Table of Contents
Fetching ...

Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories

Oorja Majgaonkar, Zhiwei Fei, Xiang Li, Federica Sarro, He Ye

TL;DR

This paper tackles the problem of understanding code agent behaviour beyond binary success by conducting an empirical trajectory analysis of three state-of-the-art code agents on the SWE-Bench benchmark. It introduces a unified framework to extract and compare execution traces, characterising how agents gather context, localise faults, and navigate failure modes. Key contributions include a trajectory dataset, cross-agent comparisons, structural analysis of failures, and a fault-localisation framework, revealing that success often hinges on approximate rather than exact edits and that longer, more variable trajectories frequently signal failure. The findings highlight the need to move beyond leaderboard metrics toward robust, interpretable autonomous software engineering systems with practical implications for agent design and evaluation.

Abstract

The increasing deployment of Large Language Model (LLM) agents for complex software engineering tasks has created a need to understand their problem-solving behaviours beyond simple success metrics. While these agents demonstrate impressive capabilities in automated issue resolution, their decision-making processes remain largely opaque. This paper presents an empirical study of agent trajectories, namely the execution traces capturing the steps agents take when attempting to resolve software issues. We analyse trajectories from three state-of-the-art code agents (OpenHands, SWE-agent, and Prometheus) on the SWE-Bench benchmark, examining both successful and failed attempts. Our investigation reveals several key insights into agent behaviour. First, we identify how distinct problem-solving strategies, such as defensive programming and context gathering, enable success in different scenarios. Second, we find that failed trajectories are consistently longer and exhibit higher variance than successful ones, with failure patterns differing significantly between agents. Third, our fault localisation analysis shows that while most trajectories correctly identify problematic files (72-81\% even in failures), success depends more on achieving approximate rather than exact code modifications. These and other findings unveiled by our study, provide a foundation for understanding agent behaviour through trajectory analysis, contributing to the development of more robust and interpretable autonomous software engineering systems.

Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories

TL;DR

This paper tackles the problem of understanding code agent behaviour beyond binary success by conducting an empirical trajectory analysis of three state-of-the-art code agents on the SWE-Bench benchmark. It introduces a unified framework to extract and compare execution traces, characterising how agents gather context, localise faults, and navigate failure modes. Key contributions include a trajectory dataset, cross-agent comparisons, structural analysis of failures, and a fault-localisation framework, revealing that success often hinges on approximate rather than exact edits and that longer, more variable trajectories frequently signal failure. The findings highlight the need to move beyond leaderboard metrics toward robust, interpretable autonomous software engineering systems with practical implications for agent design and evaluation.

Abstract

The increasing deployment of Large Language Model (LLM) agents for complex software engineering tasks has created a need to understand their problem-solving behaviours beyond simple success metrics. While these agents demonstrate impressive capabilities in automated issue resolution, their decision-making processes remain largely opaque. This paper presents an empirical study of agent trajectories, namely the execution traces capturing the steps agents take when attempting to resolve software issues. We analyse trajectories from three state-of-the-art code agents (OpenHands, SWE-agent, and Prometheus) on the SWE-Bench benchmark, examining both successful and failed attempts. Our investigation reveals several key insights into agent behaviour. First, we identify how distinct problem-solving strategies, such as defensive programming and context gathering, enable success in different scenarios. Second, we find that failed trajectories are consistently longer and exhibit higher variance than successful ones, with failure patterns differing significantly between agents. Third, our fault localisation analysis shows that while most trajectories correctly identify problematic files (72-81\% even in failures), success depends more on achieving approximate rather than exact code modifications. These and other findings unveiled by our study, provide a foundation for understanding agent behaviour through trajectory analysis, contributing to the development of more robust and interpretable autonomous software engineering systems.

Paper Structure

This paper contains 21 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Resolved issues in SWE-Bench.
  • Figure 2: Trajectory Step Counts in SWE-Bench Lite
  • Figure 3: Trajectory Step Counts in SWE-Bench Verified
  • Figure 4: Proportion of Fault Localisation Combination Outcomes