Table of Contents
Fetching ...

MetaLint: Generalizable Idiomatic Code Quality Analysis through Instruction-Following and Easy-to-Hard Generalization

Atharva Naik, Lawanya Baghel, Dhakshin Govindarajan, Darsh Agrawal, Daniel Fried, Carolyn Rose

TL;DR

MetaLint reframes code-quality analysis as idiom-based meta-tasks guided by high-level specifications, enabling LLMs to generalize from easy, linter-detectable violations to harder, context-dependent best practices. It combines synthetic linter-derived data, instruction-following fine-tuning, a verifiable reward model with RS-DPO, and reasoning-trace training to promote easy-to-hard generalization without retraining. Across Python and Java and multiple reasoning settings, MetaLint achieves strong detection recall and competitive localization, matching or approaching much larger models on hard PEP benchmarks. The work demonstrates that organizing supervision around semantic code idioms can improve robustness and transfer in evolving code-quality practices, with practical implications for configurable, scalable code analysis tooling.

Abstract

Large Language Models excel at code generation but struggle with code quality analysis, where best practices evolve and cannot be fully captured by static training data. We introduce MetaLint, a training framework that treats code quality analysis as detecting best practice violations from high-level specifications over semantic code fragments (code idioms). Instead of training on a fixed set of rules, MetaLint reorganizes supervision around dynamically specified best practices using synthetic linter-derived labels, integrated with instruction-following and preference optimization. This encourages extrapolation to more complex, unseen best practices at test time, consistent with easy-to-hard generalization without retraining. To evaluate MetaLint, we create a new benchmark of hard-to-detect best practices inspired by Python Enhancement Proposals. Across this benchmark, MetaLint improves generalization to unseen best practices. Qwen3-4B achieves a 2.7x detection F-score gain (25.9% -> 70.4%), the highest recall, and a 26.7% localization F-score, matching larger models such as o3-mini. These gains generalize across programming languages, model families, scales, reasoning settings, and linter sources.

MetaLint: Generalizable Idiomatic Code Quality Analysis through Instruction-Following and Easy-to-Hard Generalization

TL;DR

MetaLint reframes code-quality analysis as idiom-based meta-tasks guided by high-level specifications, enabling LLMs to generalize from easy, linter-detectable violations to harder, context-dependent best practices. It combines synthetic linter-derived data, instruction-following fine-tuning, a verifiable reward model with RS-DPO, and reasoning-trace training to promote easy-to-hard generalization without retraining. Across Python and Java and multiple reasoning settings, MetaLint achieves strong detection recall and competitive localization, matching or approaching much larger models on hard PEP benchmarks. The work demonstrates that organizing supervision around semantic code idioms can improve robustness and transfer in evolving code-quality practices, with practical implications for configurable, scalable code analysis tooling.

Abstract

Large Language Models excel at code generation but struggle with code quality analysis, where best practices evolve and cannot be fully captured by static training data. We introduce MetaLint, a training framework that treats code quality analysis as detecting best practice violations from high-level specifications over semantic code fragments (code idioms). Instead of training on a fixed set of rules, MetaLint reorganizes supervision around dynamically specified best practices using synthetic linter-derived labels, integrated with instruction-following and preference optimization. This encourages extrapolation to more complex, unseen best practices at test time, consistent with easy-to-hard generalization without retraining. To evaluate MetaLint, we create a new benchmark of hard-to-detect best practices inspired by Python Enhancement Proposals. Across this benchmark, MetaLint improves generalization to unseen best practices. Qwen3-4B achieves a 2.7x detection F-score gain (25.9% -> 70.4%), the highest recall, and a 26.7% localization F-score, matching larger models such as o3-mini. These gains generalize across programming languages, model families, scales, reasoning settings, and linter sources.

Paper Structure

This paper contains 38 sections, 6 equations, 4 figures, 32 tables.

Figures (4)

  • Figure 1: MetaLint: (1) Synthetic data generation with linters/tools, (2) Supervised Instruction Fine-Tuning (SFT) on this data, and (3) Verifiable Reward Model derived from the linter.
  • Figure 2: MetaLint: Preference Optimization using reward model: (4) Rejection Sampling Direct Preference Optimization (RS-DPO), and (5) Rejection Sampling Supervised Fine-Tuning (RS-SFT).
  • Figure 3: ID: In-Domain, NeT: Near Transfer, FaT: Far Transfer.
  • Figure 4: Distribution of comparative failures of the CoT MetaLint Qwen3-4B model relative to its non-CoT variant. While errors span a long tail across many PEPs, the majority are concentrated in three: PEP614, PEP593, and PEP616, which motivates our focused analysis on these cases.