An Empirical Study on LLM-based Agents for Automated Bug Fixing
Xiangxin Meng, Zexiong Ma, Pengfei Gao, Chao Peng
TL;DR
The paper investigates six top LLM-based bug-fixing agents on the SWE-bench Verified dataset, systematically comparing their effectiveness, fault localization granularity, and bug reproduction capabilities. It introduces RepoFixer to study reproduction quality and demonstrates that while performance varies across systems, symbol-level localization and robust reproduction scripts are strongly linked to repair success. The study reveals significant room for improvement in LLM reasoning and agent design, and it highlights practical guidance for building more reliable, generalizable bug-fixing agents. The results have practical implications for automatic bug fixing in real-world software maintenance, underscoring the value of high-quality issue descriptions, precise localization, and rigorous reproduction verification.
Abstract
Large language models (LLMs) and LLM-based Agents have been applied to fix bugs automatically, demonstrating the capability in addressing software defects by engaging in development environment interaction, iterative validation and code modification. However, systematic analysis of these agent systems remain limited, particularly regarding performance variations among top-performing ones. In this paper, we examine six repair systems on the SWE-bench Verified benchmark for automated bug fixing. We first assess each system's overall performance, noting the instances solvable by all or none of these systems, and explore the capabilities of different systems. We also compare fault localization accuracy at file and code symbol levels and evaluate bug reproduction capabilities. Through analysis, we concluded that further optimization is needed in both the LLM capability itself and the design of Agentic flow to improve the effectiveness of the Agent in bug fixing.
