Table of Contents
Fetching ...

Needle in the Repo: A Benchmark for Maintainability in AI-Generated Repository Edits

Haichao Zhu, Qian Zhang, Jiyuan Wang, Zhaorui Yang, Yuxin Qiu

Abstract

AI coding agents can now complete complex programming tasks, but existing evaluations largely emphasize behavioral correctness and often overlook maintainability risks such as weak modularity or testability. We present Needle in the Repo (NITR), a diagnostic probe-and-oracle framework for evaluating whether behaviorally correct repository edits preserve maintainable structure. NITR distills recurring software engineering wisdom into controlled probes embedded in small, realistic multi-file codebases, each designed so that success depends primarily on one targeted maintainability dimension. Each probe is paired with a hidden evaluation harness that combines functional tests for required behavior with structural oracles that encode the targeted maintainability constraint and return interpretable diagnoses. Using NITR, we evaluate 23 coding configurations across GPT, Claude, Gemini, and Qwen families in both direct-inference and agent-based settings. Current AI coding systems remain far from robust: on average, configurations solve only 36.2% of cases, the best reaches 57.1%, and performance drops from 53.5% on micro cases to 20.6% on multi-step cases. The hardest pressures are architectural rather than local edits, especially dependency control (4.3%) and responsibility decomposition (15.2%). Moreover, 64/483 outcomes (13.3%) pass all functional tests yet fail the structural oracle. Under our harness, agent-mode configurations improve average performance from 28.2% to 45.0%, but do not eliminate these architectural failures. These results show that progress in code generation is not yet progress in maintainable code evolution, and that NITR exposes a critical failure surface missed by conventional evaluation.

Needle in the Repo: A Benchmark for Maintainability in AI-Generated Repository Edits

Abstract

AI coding agents can now complete complex programming tasks, but existing evaluations largely emphasize behavioral correctness and often overlook maintainability risks such as weak modularity or testability. We present Needle in the Repo (NITR), a diagnostic probe-and-oracle framework for evaluating whether behaviorally correct repository edits preserve maintainable structure. NITR distills recurring software engineering wisdom into controlled probes embedded in small, realistic multi-file codebases, each designed so that success depends primarily on one targeted maintainability dimension. Each probe is paired with a hidden evaluation harness that combines functional tests for required behavior with structural oracles that encode the targeted maintainability constraint and return interpretable diagnoses. Using NITR, we evaluate 23 coding configurations across GPT, Claude, Gemini, and Qwen families in both direct-inference and agent-based settings. Current AI coding systems remain far from robust: on average, configurations solve only 36.2% of cases, the best reaches 57.1%, and performance drops from 53.5% on micro cases to 20.6% on multi-step cases. The hardest pressures are architectural rather than local edits, especially dependency control (4.3%) and responsibility decomposition (15.2%). Moreover, 64/483 outcomes (13.3%) pass all functional tests yet fail the structural oracle. Under our harness, agent-mode configurations improve average performance from 28.2% to 45.0%, but do not eliminate these architectural failures. These results show that progress in code generation is not yet progress in maintainable code evolution, and that NITR exposes a critical failure surface missed by conventional evaluation.

Paper Structure

This paper contains 19 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Existing evaluations rank models by behavioral success, but leaderboard performance says little about maintainability risk. NITR uses curated probes and structural oracles to expose failure modes beyond test passing.
  • Figure 2: Two implementations of a multi-type add function. Both pass all unit tests, but the overloaded version (top) fails the structural probe because it introduces a new definition for each type added, amplifying change across the codebase. The template version (bottom) passes both unit tests and the structural probe.
  • Figure 3: From maintainability pressure to diagnostic probe. NITR summarizes SE practices into 9 recurring repository-evolution pressures, instantiates each target pressure as a compact probe with starter code and an agent-facing task, and pairs the probe with functional and structural oracles.
  • Figure 4: Anatomy of a maintainability probe. Each probe contains starter code, an agent-visible task (TASK.md), an author-facing specification (SPEC.md), and a hidden evaluator. During evaluation, the model sees only the starter code and TASK.md; the probe passes only if the generated edits satisfy both the functional tests and the structural probes.
  • Figure 5: Multi-step probe execution. Each step is applied to the codebase produced by the previous step, so early design choices persist across later changes. The final codebase is then evaluated with tests and structural oracles.
  • ...and 3 more figures