COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis

Weiqing Yang; Hanbin Wang; Zhenghao Liu; Xinze Li; Yukun Yan; Shuo Wang; Yu Gu; Minghe Yu; Zhiyuan Liu; Ge Yu

COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis

Weiqing Yang, Hanbin Wang, Zhenghao Liu, Xinze Li, Yukun Yan, Shuo Wang, Yu Gu, Minghe Yu, Zhiyuan Liu, Ge Yu

TL;DR

This work argues that evaluating LLM debugging requires assessing multiple stages beyond Code Repair. It introduces DebugEval, a four-task benchmark spanning Bug Localization, Bug Identification, Code Repair, and Code Recognition across Python, C++, and Java, including bug data from humans and GPT-4 to reflect real-world debugging. To improve LLM debugging capabilities, it proposes COAST, a three-agent data synthesis framework that generates high-quality SFT data and a NeuDebugger fine-tuned model. Empirical results show COAST-augmented training closes much of the gap for 7B-scale LLMs relative to GPT-3.5 and reveals nuanced effects of Chain-of-Thought on different tasks, underscoring the importance of data quality in targeted debugging improvements.

Abstract

Code debugging is a vital stage of software development, essential for ensuring the reliability and performance of Large Language Models (LLMs) in the code generation task. Human debugging typically follows a multi-stage process, which includes Bug Localization, Bug Identification, Code Repair, and Code Recognition. However, existing code debugging benchmarks predominantly focus on the Code Repair stage, which offers only a limited perspective on evaluating the debugging capabilities of LLMs. In this paper, we introduce DEBUGEVAL, a comprehensive benchmark for evaluating the debugging abilities of LLMs by emulating the multi-stage human debugging process. Through evaluating on DEBUGEVAL, we observe that 7B-scale models consistently underperform compared to their larger counterparts, highlighting their limitations in comprehending code semantics. In this case, we propose the COmmunicative Agent-based data SynThesis (COAST) framework, which employs a multi-agent system to generate high-quality training data for supervised fine-tuning (SFT). Experimental results demonstrate that COAST-generated data outperform human-curated and GPT-4-generated data, enabling 7B-scale LLMs to achieve debugging performance comparable to GPT-3.5. All data and codes are available at https://github.com/NEUIR/COAST.

COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis

TL;DR

Abstract

Paper Structure (24 sections, 9 figures, 6 tables)

This paper contains 24 sections, 9 figures, 6 tables.

Introduction
Related Work
DebugEval: Benchmarking the Debugging Capabilities of LLMs
Task Definition
Details of Data Construction
Comparison of Different Debugging Benchmarks
COAST: Communicative Agent Based Data Synthesis Framework
Agent Building
Synthesizing SFT Data through Multi-Agent Interactions
Experimental Methodology
Evaluation Results
Overall Performance
Ablation Studies
Effectiveness of NeuDebugger on Different Bug Types
Conclusion
...and 9 more sections

Figures (9)

Figure 1: Illustration of the Human Code Debugging Process. Existing studies olausson2023selfchen2023teaching typically focus on directly repairing code generated by Large Language Models (LLMs). In contrast to this approach, humans often engage in a multi-stage process to resolve buggy codes.
Figure 2: Illustration of DebugEval Benchmark. The DebugEval includes four key tasks: BUG Localization, BUG Identification, Code Repair, and Code Recognition.
Figure 3: Illustration of COmmunicative Agent Based Data SynThesis (COAST) Framework.
Figure 4: Response Distributions of DSCoder-6.7B-Ins and NeuDebugger-DS-6.7B in BUG Identification Task.
Figure 5: Illustrations of Prompts Used in COAST to Configure Different Agents. Within COAST, there are three LLM-based agents, including Code Quizzer, Code Learner, and Code Teacher. We utilize specific instructions to ensure they play the correct roles and carry out the intended tasks.
...and 4 more figures

COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis

TL;DR

Abstract

COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (9)