Table of Contents
Fetching ...

Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios

Zhi Chen, Wei Ma, Lingxiao Jiang

TL;DR

The paper tackles the problem of evaluating AI-driven software development agents by analyzing the dynamic error-resolving processes they undergo while solving real-world GitHub issues. It introduces a process-centric study using SWE-Bench Verified data, collecting 3,977 solving-phase trajectories and 3,931 testing-phase logs from eight top-ranked agents across 500 tasks to characterize errors, their prevalence, and their recurrence. The authors identify that while individual errors do not drastically reduce patch success, high error frequency and persistent (cross-phase) errors substantially hinder performance, with environment- and database-related failures being particularly challenging. They also uncover three SWE-Bench platform bugs affecting fairness and measurement accuracy, and publicly share their data and scripts to foster reproducibility and future research. Overall, the work highlights the need for proactive error avoidance and robust recovery mechanisms to make automated software development agents more reliable and efficient.

Abstract

AI-driven software development has rapidly advanced with the emergence of software development agents that leverage large language models (LLMs) to tackle complex, repository-level software engineering tasks. These agents go beyond just generation of final code; they engage in multi-step reasoning, utilize various tools for code modification and debugging, and interact with execution environments to diagnose and iteratively resolve issues. However, most existing evaluations focus primarily on static analyses of final code outputs, yielding limited insights into the agents' dynamic problem-solving processes. To fill this gap, we conduct an in-depth empirical study on 3,977 solving-phase trajectories and 3,931 testing-phase logs from 8 top-ranked agents evaluated on 500 GitHub issues in the SWE-Bench benchmark. Our exploratory analysis shows that Python execution errors during the issue resolution phase correlate with lower resolution rates and increased reasoning overheads. We have identified the most prevalent errors -- such as ModuleNotFoundError and TypeError -- and highlighted particularly challenging errors like OSError and database-related issues (e.g., IntegrityError) that demand significantly more debugging effort. Furthermore, we have discovered 3 bugs in the SWE-Bench platform that affect benchmark fairness and accuracy; these issues have been reported to and confirmed by the maintainers. To promote transparency and foster future research, we publicly share our datasets and analysis scripts.

Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios

TL;DR

The paper tackles the problem of evaluating AI-driven software development agents by analyzing the dynamic error-resolving processes they undergo while solving real-world GitHub issues. It introduces a process-centric study using SWE-Bench Verified data, collecting 3,977 solving-phase trajectories and 3,931 testing-phase logs from eight top-ranked agents across 500 tasks to characterize errors, their prevalence, and their recurrence. The authors identify that while individual errors do not drastically reduce patch success, high error frequency and persistent (cross-phase) errors substantially hinder performance, with environment- and database-related failures being particularly challenging. They also uncover three SWE-Bench platform bugs affecting fairness and measurement accuracy, and publicly share their data and scripts to foster reproducibility and future research. Overall, the work highlights the need for proactive error avoidance and robust recovery mechanisms to make automated software development agents more reliable and efficient.

Abstract

AI-driven software development has rapidly advanced with the emergence of software development agents that leverage large language models (LLMs) to tackle complex, repository-level software engineering tasks. These agents go beyond just generation of final code; they engage in multi-step reasoning, utilize various tools for code modification and debugging, and interact with execution environments to diagnose and iteratively resolve issues. However, most existing evaluations focus primarily on static analyses of final code outputs, yielding limited insights into the agents' dynamic problem-solving processes. To fill this gap, we conduct an in-depth empirical study on 3,977 solving-phase trajectories and 3,931 testing-phase logs from 8 top-ranked agents evaluated on 500 GitHub issues in the SWE-Bench benchmark. Our exploratory analysis shows that Python execution errors during the issue resolution phase correlate with lower resolution rates and increased reasoning overheads. We have identified the most prevalent errors -- such as ModuleNotFoundError and TypeError -- and highlighted particularly challenging errors like OSError and database-related issues (e.g., IntegrityError) that demand significantly more debugging effort. Furthermore, we have discovered 3 bugs in the SWE-Bench platform that affect benchmark fairness and accuracy; these issues have been reported to and confirmed by the maintainers. To promote transparency and foster future research, we publicly share our datasets and analysis scripts.

Paper Structure

This paper contains 44 sections, 2 figures, 12 tables.

Figures (2)

  • Figure 1: Study overview: solving-phase trajectories inform analyses of unexpected-error impact (RQ1), common-error prevalence (RQ2), and challenging-error identification (RQ3); testing-phase logs reveal testing errors and failures(RQ4).
  • Figure 2: Resolution Rate by Error Frequency