Table of Contents
Fetching ...

Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure

Tural Mehtiyev, Wesley Assunção

Abstract

Coding agents represent a new paradigm in automated software engineering, combining the reasoning capabilities of Large Language Models (LLMs) with tool-augmented interaction loops. However, coding agents still have severe limitations. Top-ranked LLM-based coding agents still fail on over 20% of benchmarked problems. Yet, we lack a systematic understanding of why (i.e., the causes) agents fail, and how failure unfolds behaviorally. We present a large-scale empirical study analyzing 9,374 trajectories from 19 agents (8 coding agent frameworks, 14 LLMs) on 500 tasks. We organize our analysis around three research questions. First, we investigate why agents fail on specific tasks and find that patch complexity alone does not explain difficulty: 12 never-solved tasks require only simple patches and were considered easy by human annotators, yet all agents fail due to gaps in architectural reasoning and domain knowledge. Second, we examine how behavioral patterns differentiate success from failure. The widely reported correlation between trajectory length and failure reverses direction once task difficulty is controlled, revealing it as a confound. Instead, trajectory structure discriminates consistently: agents that gather context before editing and invest in validation succeed more often, and these strategies are agent-determined rather than task-adaptive. Third, we disentangle LLM capability from framework design and find that the LLM is the primary driver of both outcome and behavior: agents sharing the same LLM agree on far more tasks than agents sharing the same framework, and the framework performance gap shrinks with each generation of LLM improvement. Framework prompts do influence agent tactics, but this influence diminishes with stronger LLMs.

Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure

Abstract

Coding agents represent a new paradigm in automated software engineering, combining the reasoning capabilities of Large Language Models (LLMs) with tool-augmented interaction loops. However, coding agents still have severe limitations. Top-ranked LLM-based coding agents still fail on over 20% of benchmarked problems. Yet, we lack a systematic understanding of why (i.e., the causes) agents fail, and how failure unfolds behaviorally. We present a large-scale empirical study analyzing 9,374 trajectories from 19 agents (8 coding agent frameworks, 14 LLMs) on 500 tasks. We organize our analysis around three research questions. First, we investigate why agents fail on specific tasks and find that patch complexity alone does not explain difficulty: 12 never-solved tasks require only simple patches and were considered easy by human annotators, yet all agents fail due to gaps in architectural reasoning and domain knowledge. Second, we examine how behavioral patterns differentiate success from failure. The widely reported correlation between trajectory length and failure reverses direction once task difficulty is controlled, revealing it as a confound. Instead, trajectory structure discriminates consistently: agents that gather context before editing and invest in validation succeed more often, and these strategies are agent-determined rather than task-adaptive. Third, we disentangle LLM capability from framework design and find that the LLM is the primary driver of both outcome and behavior: agents sharing the same LLM agree on far more tasks than agents sharing the same framework, and the framework performance gap shrinks with each generation of LLM improvement. Framework prompts do influence agent tactics, but this influence diminishes with stronger LLMs.

Paper Structure

This paper contains 38 sections, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Contrasting agent trajectories on django-15863. Both agents use the SWE-agent framework. Claude 4 Sonnet (left) follows a structured workflow: browse the codebase, reproduce the bug, make one surgical edit, then run several verification cycles before submitting. GPT-4 (right) finds the correct file but then edits the code many times, producing 28 syntax errors, without ever running the project's test suite.
  • Figure 2: Task difficulty distribution across 500 SWE-bench Verified tasks.
  • Figure 3: Representative case study: matplotlib-23476. The agent fixes DPI scaling in display backends (4 files); the gold patch fixes the serialization in __getstate__ (3 lines). Both address the same bug from different architectural levels.
  • Figure 4: Trajectory length by task difficulty for all 19 agents. Every agent takes more steps on harder tasks (all Spearman $\rho < 0$, all $p < 0.01$). This universal length--difficulty coupling explains the confounding reversal: length reflects task difficulty, not strategy quality.
  • Figure 5: Same task and length (32 steps), different agent, different outcome. Left: CodeSweep/kimi-k2 explores, struggles, breaks through (Vp), submits. Right: SWE-agent/gpt-4 skips exploration, enters an endless P$\to$Ve spiral, never recovers.
  • ...and 5 more figures