Table of Contents
Fetching ...

The Specification Gap: Coordination Failure Under Partial Knowledge in Code Agents

Camilo Chacón Sartori

Abstract

When multiple LLM-based code agents independently implement parts of the same class, they must agree on shared internal representations, even when the specification leaves those choices implicit. We study this coordination problem across 51 class-generation tasks, progressively stripping specification detail from full docstrings (L0) to bare signatures (L3), and introducing opposing structural biases (lists vs. dictionaries) to stress-test integration. Three findings emerge. First, a persistent specification gap: two-agent integration accuracy drops from 58% to 25% as detail is removed, while a single-agent baseline degrades more gracefully (89% to 56%), leaving a 25--39 pp coordination gap that is consistent across two Claude models (Sonnet, Haiku) and three independent runs. Second, an AST-based conflict detector achieves 97% precision at the weakest specification level without additional LLM calls, yet a factorial recovery experiment shows that restoring the full specification alone recovers the single-agent ceiling (89%), while providing conflict reports adds no measurable benefit. Third, decomposing the gap into coordination cost (+16 pp) and information asymmetry (+11 pp) suggests that the two effects are independent and approximately additive. The gap is not merely a consequence of hidden information, but reflects the difficulty of producing compatible code without shared decisions. These results support a specification-first view of multi-agent code generation: richer specifications are both the primary coordination mechanism and the sufficient recovery instrument.

The Specification Gap: Coordination Failure Under Partial Knowledge in Code Agents

Abstract

When multiple LLM-based code agents independently implement parts of the same class, they must agree on shared internal representations, even when the specification leaves those choices implicit. We study this coordination problem across 51 class-generation tasks, progressively stripping specification detail from full docstrings (L0) to bare signatures (L3), and introducing opposing structural biases (lists vs. dictionaries) to stress-test integration. Three findings emerge. First, a persistent specification gap: two-agent integration accuracy drops from 58% to 25% as detail is removed, while a single-agent baseline degrades more gracefully (89% to 56%), leaving a 25--39 pp coordination gap that is consistent across two Claude models (Sonnet, Haiku) and three independent runs. Second, an AST-based conflict detector achieves 97% precision at the weakest specification level without additional LLM calls, yet a factorial recovery experiment shows that restoring the full specification alone recovers the single-agent ceiling (89%), while providing conflict reports adds no measurable benefit. Third, decomposing the gap into coordination cost (+16 pp) and information asymmetry (+11 pp) suggests that the two effects are independent and approximately additive. The gap is not merely a consequence of hidden information, but reflects the difficulty of producing compatible code without shared decisions. These results support a specification-first view of multi-agent code generation: richer specifications are both the primary coordination mechanism and the sufficient recovery instrument.

Paper Structure

This paper contains 37 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: The specification gap and its resolution. Top: A single agent with the full specification achieves 89% pass rate. Middle: Two biased agents with bare specifications produce conflicting code (25%). Bottom: Restoring the full specification to a merger agent recovers 89%---matching the single-agent ceiling---while conflict reports add nothing. The specification is both the cause of failure and the sufficient instrument of recovery.
  • Figure 2: Experimental overview. Each task's class skeleton is processed at specification level $\ell\in\{$L0–L3$\}$. The single condition (top) gives one unbiased agent the full skeleton including __init__. The split condition (bottom) hides __init__ and assigns methods to two biased agents whose outputs are merged. An AST-based detector (dashed) checks for structural conflicts before integration.
  • Figure 3: Test pass rates degrade monotonically as specification detail is removed. The shaded band highlights the persistent coordination gap (25--39pp across models) between single-agent and split-agent conditions. The dashed line marks the transition where explicit data-structure references are stripped (L1$\to$L2).
  • Figure 4: AST conflict detector recall and precision by specification level. Precision improves dramatically as specifications degrade, reaching 96.7% at L3---the detector is most reliable precisely where it is most needed.
  • Figure 5: Recovery experiment results ($n=53$ tasks). (a) Six conditions showing Spec-Only (88.9%) matches Single (88.3%), while Blind and Guided are identical (52.7%)---conflict information has no effect without the full specification. (b) Interaction plot: the L0 specification accounts for +36pp recovery; conflict reports contribute $\Delta = 0$pp at L3 and $-6.6$pp at L0. Dashed line: single-agent baseline.
  • ...and 4 more figures