MetaLint: Generalizable Idiomatic Code Quality Analysis through Instruction-Following and Easy-to-Hard Generalization
Atharva Naik, Lawanya Baghel, Dhakshin Govindarajan, Darsh Agrawal, Daniel Fried, Carolyn Rose
TL;DR
MetaLint reframes code-quality analysis as idiom-based meta-tasks guided by high-level specifications, enabling LLMs to generalize from easy, linter-detectable violations to harder, context-dependent best practices. It combines synthetic linter-derived data, instruction-following fine-tuning, a verifiable reward model with RS-DPO, and reasoning-trace training to promote easy-to-hard generalization without retraining. Across Python and Java and multiple reasoning settings, MetaLint achieves strong detection recall and competitive localization, matching or approaching much larger models on hard PEP benchmarks. The work demonstrates that organizing supervision around semantic code idioms can improve robustness and transfer in evolving code-quality practices, with practical implications for configurable, scalable code analysis tooling.
Abstract
Large Language Models excel at code generation but struggle with code quality analysis, where best practices evolve and cannot be fully captured by static training data. We introduce MetaLint, a training framework that treats code quality analysis as detecting best practice violations from high-level specifications over semantic code fragments (code idioms). Instead of training on a fixed set of rules, MetaLint reorganizes supervision around dynamically specified best practices using synthetic linter-derived labels, integrated with instruction-following and preference optimization. This encourages extrapolation to more complex, unseen best practices at test time, consistent with easy-to-hard generalization without retraining. To evaluate MetaLint, we create a new benchmark of hard-to-detect best practices inspired by Python Enhancement Proposals. Across this benchmark, MetaLint improves generalization to unseen best practices. Qwen3-4B achieves a 2.7x detection F-score gain (25.9% -> 70.4%), the highest recall, and a 26.7% localization F-score, matching larger models such as o3-mini. These gains generalize across programming languages, model families, scales, reasoning settings, and linter sources.
