COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis
Weiqing Yang, Hanbin Wang, Zhenghao Liu, Xinze Li, Yukun Yan, Shuo Wang, Yu Gu, Minghe Yu, Zhiyuan Liu, Ge Yu
TL;DR
This work argues that evaluating LLM debugging requires assessing multiple stages beyond Code Repair. It introduces DebugEval, a four-task benchmark spanning Bug Localization, Bug Identification, Code Repair, and Code Recognition across Python, C++, and Java, including bug data from humans and GPT-4 to reflect real-world debugging. To improve LLM debugging capabilities, it proposes COAST, a three-agent data synthesis framework that generates high-quality SFT data and a NeuDebugger fine-tuned model. Empirical results show COAST-augmented training closes much of the gap for 7B-scale LLMs relative to GPT-3.5 and reveals nuanced effects of Chain-of-Thought on different tasks, underscoring the importance of data quality in targeted debugging improvements.
Abstract
Code debugging is a vital stage of software development, essential for ensuring the reliability and performance of Large Language Models (LLMs) in the code generation task. Human debugging typically follows a multi-stage process, which includes Bug Localization, Bug Identification, Code Repair, and Code Recognition. However, existing code debugging benchmarks predominantly focus on the Code Repair stage, which offers only a limited perspective on evaluating the debugging capabilities of LLMs. In this paper, we introduce DEBUGEVAL, a comprehensive benchmark for evaluating the debugging abilities of LLMs by emulating the multi-stage human debugging process. Through evaluating on DEBUGEVAL, we observe that 7B-scale models consistently underperform compared to their larger counterparts, highlighting their limitations in comprehending code semantics. In this case, we propose the COmmunicative Agent-based data SynThesis (COAST) framework, which employs a multi-agent system to generate high-quality training data for supervised fine-tuning (SFT). Experimental results demonstrate that COAST-generated data outperform human-curated and GPT-4-generated data, enabling 7B-scale LLMs to achieve debugging performance comparable to GPT-3.5. All data and codes are available at https://github.com/NEUIR/COAST.
