Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories

Oorja Majgaonkar; Zhiwei Fei; Xiang Li; Federica Sarro; He Ye

Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories

Oorja Majgaonkar, Zhiwei Fei, Xiang Li, Federica Sarro, He Ye

TL;DR

This paper tackles the problem of understanding code agent behaviour beyond binary success by conducting an empirical trajectory analysis of three state-of-the-art code agents on the SWE-Bench benchmark. It introduces a unified framework to extract and compare execution traces, characterising how agents gather context, localise faults, and navigate failure modes. Key contributions include a trajectory dataset, cross-agent comparisons, structural analysis of failures, and a fault-localisation framework, revealing that success often hinges on approximate rather than exact edits and that longer, more variable trajectories frequently signal failure. The findings highlight the need to move beyond leaderboard metrics toward robust, interpretable autonomous software engineering systems with practical implications for agent design and evaluation.

Abstract

The increasing deployment of Large Language Model (LLM) agents for complex software engineering tasks has created a need to understand their problem-solving behaviours beyond simple success metrics. While these agents demonstrate impressive capabilities in automated issue resolution, their decision-making processes remain largely opaque. This paper presents an empirical study of agent trajectories, namely the execution traces capturing the steps agents take when attempting to resolve software issues. We analyse trajectories from three state-of-the-art code agents (OpenHands, SWE-agent, and Prometheus) on the SWE-Bench benchmark, examining both successful and failed attempts. Our investigation reveals several key insights into agent behaviour. First, we identify how distinct problem-solving strategies, such as defensive programming and context gathering, enable success in different scenarios. Second, we find that failed trajectories are consistently longer and exhibit higher variance than successful ones, with failure patterns differing significantly between agents. Third, our fault localisation analysis shows that while most trajectories correctly identify problematic files (72-81\% even in failures), success depends more on achieving approximate rather than exact code modifications. These and other findings unveiled by our study, provide a foundation for understanding agent behaviour through trajectory analysis, contributing to the development of more robust and interpretable autonomous software engineering systems.

Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories

TL;DR

Abstract

Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)