Table of Contents
Fetching ...

COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis

Weiqing Yang, Hanbin Wang, Zhenghao Liu, Xinze Li, Yukun Yan, Shuo Wang, Yu Gu, Minghe Yu, Zhiyuan Liu, Ge Yu

TL;DR

This work argues that evaluating LLM debugging requires assessing multiple stages beyond Code Repair. It introduces DebugEval, a four-task benchmark spanning Bug Localization, Bug Identification, Code Repair, and Code Recognition across Python, C++, and Java, including bug data from humans and GPT-4 to reflect real-world debugging. To improve LLM debugging capabilities, it proposes COAST, a three-agent data synthesis framework that generates high-quality SFT data and a NeuDebugger fine-tuned model. Empirical results show COAST-augmented training closes much of the gap for 7B-scale LLMs relative to GPT-3.5 and reveals nuanced effects of Chain-of-Thought on different tasks, underscoring the importance of data quality in targeted debugging improvements.

Abstract

Code debugging is a vital stage of software development, essential for ensuring the reliability and performance of Large Language Models (LLMs) in the code generation task. Human debugging typically follows a multi-stage process, which includes Bug Localization, Bug Identification, Code Repair, and Code Recognition. However, existing code debugging benchmarks predominantly focus on the Code Repair stage, which offers only a limited perspective on evaluating the debugging capabilities of LLMs. In this paper, we introduce DEBUGEVAL, a comprehensive benchmark for evaluating the debugging abilities of LLMs by emulating the multi-stage human debugging process. Through evaluating on DEBUGEVAL, we observe that 7B-scale models consistently underperform compared to their larger counterparts, highlighting their limitations in comprehending code semantics. In this case, we propose the COmmunicative Agent-based data SynThesis (COAST) framework, which employs a multi-agent system to generate high-quality training data for supervised fine-tuning (SFT). Experimental results demonstrate that COAST-generated data outperform human-curated and GPT-4-generated data, enabling 7B-scale LLMs to achieve debugging performance comparable to GPT-3.5. All data and codes are available at https://github.com/NEUIR/COAST.

COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis

TL;DR

This work argues that evaluating LLM debugging requires assessing multiple stages beyond Code Repair. It introduces DebugEval, a four-task benchmark spanning Bug Localization, Bug Identification, Code Repair, and Code Recognition across Python, C++, and Java, including bug data from humans and GPT-4 to reflect real-world debugging. To improve LLM debugging capabilities, it proposes COAST, a three-agent data synthesis framework that generates high-quality SFT data and a NeuDebugger fine-tuned model. Empirical results show COAST-augmented training closes much of the gap for 7B-scale LLMs relative to GPT-3.5 and reveals nuanced effects of Chain-of-Thought on different tasks, underscoring the importance of data quality in targeted debugging improvements.

Abstract

Code debugging is a vital stage of software development, essential for ensuring the reliability and performance of Large Language Models (LLMs) in the code generation task. Human debugging typically follows a multi-stage process, which includes Bug Localization, Bug Identification, Code Repair, and Code Recognition. However, existing code debugging benchmarks predominantly focus on the Code Repair stage, which offers only a limited perspective on evaluating the debugging capabilities of LLMs. In this paper, we introduce DEBUGEVAL, a comprehensive benchmark for evaluating the debugging abilities of LLMs by emulating the multi-stage human debugging process. Through evaluating on DEBUGEVAL, we observe that 7B-scale models consistently underperform compared to their larger counterparts, highlighting their limitations in comprehending code semantics. In this case, we propose the COmmunicative Agent-based data SynThesis (COAST) framework, which employs a multi-agent system to generate high-quality training data for supervised fine-tuning (SFT). Experimental results demonstrate that COAST-generated data outperform human-curated and GPT-4-generated data, enabling 7B-scale LLMs to achieve debugging performance comparable to GPT-3.5. All data and codes are available at https://github.com/NEUIR/COAST.
Paper Structure (24 sections, 9 figures, 6 tables)

This paper contains 24 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Illustration of the Human Code Debugging Process. Existing studies olausson2023selfchen2023teaching typically focus on directly repairing code generated by Large Language Models (LLMs). In contrast to this approach, humans often engage in a multi-stage process to resolve buggy codes.
  • Figure 2: Illustration of DebugEval Benchmark. The DebugEval includes four key tasks: BUG Localization, BUG Identification, Code Repair, and Code Recognition.
  • Figure 3: Illustration of COmmunicative Agent Based Data SynThesis (COAST) Framework.
  • Figure 4: Response Distributions of DSCoder-6.7B-Ins and NeuDebugger-DS-6.7B in BUG Identification Task.
  • Figure 5: Illustrations of Prompts Used in COAST to Configure Different Agents. Within COAST, there are three LLM-based agents, including Code Quizzer, Code Learner, and Code Teacher. We utilize specific instructions to ensure they play the correct roles and carry out the intended tasks.
  • ...and 4 more figures