From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Yuling Shi; Songsong Wang; Chengcheng Wan; Min Wang; Xiaodong Gu

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Yuling Shi, Songsong Wang, Chengcheng Wan, Min Wang, Xiaodong Gu

TL;DR

Subsequent code produced by LLMs often contains subtle bugs that block test success. The authors introduce MGDebugger, a bottom-up hierarchical debugger that decomposes code into a tree of subfunctions, generates subfunction-specific tests, and uses an LLM-simulated executor to trace execution and fix bugs. Across multiple benchmarks and backbones, MGDebugger yields substantial accuracy gains and repair-success rates and generalizes to real-world defects in Defects4J, outpacing existing debugging and repair methods. This work enables a more reliable, scalable approach to automatic code repair with potential for broader adoption in AI-assisted software development and self-improving code-generation systems.

Abstract

While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing LLM-based debugging systems treat generated programs as monolithic units, failing to address bugs at multiple levels of granularity, from low-level syntax errors to high-level algorithmic flaws. In this paper, we introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger by isolating, identifying, and resolving bugs at various levels of granularity. MGDebugger decomposes problematic code into a hierarchical tree structure of subfunctions, with each level representing a particular granularity of error. During debugging, it analyzes each subfunction and iteratively resolves bugs in a bottom-up manner. To effectively test each subfunction, we propose an LLM-simulated Python executor, which traces code execution and tracks important variable states to pinpoint errors accurately. Extensive experiments demonstrate that MGDebugger outperforms existing debugging systems, achieving an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6% repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes bugs across different categories and difficulty levels, demonstrating its robustness and effectiveness.

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

TL;DR

Abstract

Paper Structure (24 sections, 6 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 6 figures, 6 tables, 1 algorithm.

Introduction
Methodology
Overview
Hierarchical Code Decomposition
Generating Test Cases for Subfunctions
Debugging Subfunctions with LLM-simulated Execution
Bottom-up Debugging
Experiments
Setup
Ablation Study (RQ2)
Debugging Different Types of Bugs (RQ3)
Performance Consistency (RQ4)
Generalization to Real-World Software Defects (RQ5)
Discussion
Case Study
...and 9 more sections

Figures (6)

Figure 1: Workflow of MGDebugger compared to existing methods.
Figure 2: Illustration of the subfunction debugging process in MGDebugger.
Figure 3: Repair success rate of different methods when debugging code of different lengths.
Figure 4: Impact of debug attempts on the cumulative repair success rate of different methods.
Figure 5: Bug fix Venn diagram in Defects4J V1.2.
...and 1 more figures

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

TL;DR

Abstract

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Authors

TL;DR

Abstract

Table of Contents

Figures (6)