Table of Contents
Fetching ...

When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

Lu Yan, Xuan Chen, Xiangyu Zhang

Abstract

Current coding-agent benchmarks usually pro- vide the full task specification upfront. Real research coding often does not: the intended system is progressively disclosed through in- teraction, requiring the agent to track durable design commitments across a long session. We introduce a benchmark for this setting and study faithfulne Ss Loss U nder eM ergent s Pecification (SLUMP), defined as the reduc- tion in final implementation faithfulness un- der emergent specification relative to a single- shot specification control. The benchmark con- tains 20 recent ML papers (10 ICML 2025, 10 NeurIPS 2025), 371 atomic verifiable compo- nents, and interaction scripts of approximately 60 coding requests that progressively disclose the target design without revealing the paper itself. Final repositories are scored with a five-level component-faithfulness rubric and accompanied by an exposure audit to verify that scored components are recoverable from the visible interaction. Evaluated on Claude Code and Codex, the single-shot specification control achieves higher overall implementation fidelity on 16/20 and 14/20 papers, respectively. Structural integration degrades under emergent specification on both platforms, while seman- tic faithfulness loss is substantial on Claude Code and small on Codex. As a mitigation case study, we introduce ProjectGuard, an exter- nal project-state layer for specification tracking. On Claude Code, ProjectGuard recovers 90% of the faithfulness gap, increases fully faith- ful components from 118 to 181, and reduces severe failures from 72 to 49. These results identify specification tracking as a distinct eval- uation target for long-horizon coding agents.

When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

Abstract

Current coding-agent benchmarks usually pro- vide the full task specification upfront. Real research coding often does not: the intended system is progressively disclosed through in- teraction, requiring the agent to track durable design commitments across a long session. We introduce a benchmark for this setting and study faithfulne Ss Loss U nder eM ergent s Pecification (SLUMP), defined as the reduc- tion in final implementation faithfulness un- der emergent specification relative to a single- shot specification control. The benchmark con- tains 20 recent ML papers (10 ICML 2025, 10 NeurIPS 2025), 371 atomic verifiable compo- nents, and interaction scripts of approximately 60 coding requests that progressively disclose the target design without revealing the paper itself. Final repositories are scored with a five-level component-faithfulness rubric and accompanied by an exposure audit to verify that scored components are recoverable from the visible interaction. Evaluated on Claude Code and Codex, the single-shot specification control achieves higher overall implementation fidelity on 16/20 and 14/20 papers, respectively. Structural integration degrades under emergent specification on both platforms, while seman- tic faithfulness loss is substantial on Claude Code and small on Codex. As a mitigation case study, we introduce ProjectGuard, an exter- nal project-state layer for specification tracking. On Claude Code, ProjectGuard recovers 90% of the faithfulness gap, increases fully faith- ful components from 118 to 181, and reduces severe failures from 72 to 49. These results identify specification tracking as a distinct eval- uation target for long-horizon coding agents.
Paper Structure (53 sections, 7 equations, 5 figures, 6 tables)

This paper contains 53 sections, 7 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Cumulative exposure over turns. Left: fraction of components explicitly specified ($R{=}4$). Right: fraction recoverable ($R{\ge}1$). Each line is one paper. The gap between panels reflects the ambiguous middle where components are inferable but not yet fully specified.
  • Figure 2: Per-paper IF50 under emergent specification and the single-shot specification control.
  • Figure 3: Test pass rate fails to detect SLUMP.
  • Figure 4: Default long-horizon coding workflow. User requests, tool outputs, and code edits accumulate in a single live conversation, with older context periodically condensed into a compact summary. This representation preserves recent interaction state but provides only a weak project-level view of earlier design commitments and repository structure, making specification tracking under emergent specification difficult.
  • Figure 5: ProjectGuard overview. Before each coding turn, the system combines the request history with the current repository skeleton to maintain two external views of the project: a semantic state of committed but revisable project knowledge, and a structural state of files, symbols, and interface relations. A forecaster then prepares a project-aware brief for the coding agent, highlighting relevant design commitments, compatibility constraints, and candidate modules for revision or reuse. When the remaining live context is unlikely to support the upcoming request, ProjectGuard can trigger a proactive restart and inject the rendered project state into a fresh session. After the turn completes, both semantic and structural state are updated from the new interaction and repository changes.