Table of Contents
Fetching ...

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

Gabriel Orlanski, Devjeet Roy, Alexander Yun, Changho Shin, Alex Gu, Albert Ge, Dyah Adila, Frederic Sala, Aws Albarghouthi

Abstract

Software development is iterative, yet agentic coding benchmarks overwhelmingly evaluate single-shot solutions against complete specifications. Code can pass the test suite but become progressively harder to extend. Recent iterative benchmarks attempt to close this gap, but constrain the agent's design decisions too tightly to faithfully measure how code quality shapes future extensions. We introduce SlopCodeBench, a language-agnostic benchmark comprising 20 problems and 93 checkpoints, in which agents repeatedly extend their own prior solutions under evolving specifications that force architectural decisions without prescribing internal structure. We track two trajectory-level quality signals: verbosity, the fraction of redundant or duplicated code, and structural erosion, the share of complexity mass concentrated in high-complexity functions. No agent solves any problem end-to-end across 11 models; the highest checkpoint solve rate is 17.2%. Quality degrades steadily: erosion rises in 80% of trajectories and verbosity in 89.8%. Against 48 open-source Python repositories, agent code is 2.2x more verbose and markedly more eroded. Tracking 20 of those repositories over time shows that human code stays flat, while agent code deteriorates with each iteration. A prompt-intervention study shows that initial quality can be improved, but it does not halt degradation. These results demonstrate that pass-rate benchmarks systematically undermeasure extension robustness, and that current agents lack the design discipline iterative software development demands.

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

Abstract

Software development is iterative, yet agentic coding benchmarks overwhelmingly evaluate single-shot solutions against complete specifications. Code can pass the test suite but become progressively harder to extend. Recent iterative benchmarks attempt to close this gap, but constrain the agent's design decisions too tightly to faithfully measure how code quality shapes future extensions. We introduce SlopCodeBench, a language-agnostic benchmark comprising 20 problems and 93 checkpoints, in which agents repeatedly extend their own prior solutions under evolving specifications that force architectural decisions without prescribing internal structure. We track two trajectory-level quality signals: verbosity, the fraction of redundant or duplicated code, and structural erosion, the share of complexity mass concentrated in high-complexity functions. No agent solves any problem end-to-end across 11 models; the highest checkpoint solve rate is 17.2%. Quality degrades steadily: erosion rises in 80% of trajectories and verbosity in 89.8%. Against 48 open-source Python repositories, agent code is 2.2x more verbose and markedly more eroded. Tracking 20 of those repositories over time shows that human code stays flat, while agent code deteriorates with each iteration. A prompt-intervention study shows that initial quality can be improved, but it does not halt degradation. These results demonstrate that pass-rate benchmarks systematically undermeasure extension robustness, and that current agents lack the design discipline iterative software development demands.

Paper Structure

This paper contains 49 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Iterative evaluation in SlopCodeBench. At each checkpoint, the agent receives an updated specification and extends its own prior solution. File-level diffs grow across checkpoints, reflecting accumulated code changes.
  • Figure 2: Solve rates and cost growth over problem progress. Left: Agents pass all core tests 1.4--13.3$\times$ more often than the full checkpoint suite, and strict solve rates, which include regression, collapse to 0.5% by the final checkpoint. Right: Mean cost per checkpoint grows 2.9$\times$ from the first to the last progress bin. See \ref{['sec:test-pass-rates']} for a breakdown of continuous pass rates by test type.
  • Figure 3: Erosion and verbosity across problem progress for six representative models (three per provider). Both metrics increase monotonically.
  • Figure 4: Mean verbosity and structural erosion across normalized trajectory progress for agent runs and human repositories. Shaded regions show 95% confidence intervals. Agent metrics climb monotonically; human metrics plateau.
  • Figure 5: Prompt strategy trajectories across two models. Each point shows the mean value at a normalized progress bin. Quality-aware prompts (Anti-Slop and Plan-First) lower the initial verbosity and erosion compared to the Baseline (just-solve), but the trajectories remain largely parallel across progress. Slopes ($m$) indicate the mean degradation per bin. Despite significant gains in structural quality, pass rates (column 3) do not improve consistently on either model ($p > 0.05$ via paired Wilcoxon signed-rank tests). Quality-aware prompting also increases per-checkpoint cost (column 4).
  • ...and 1 more figures