Effective Large Language Model Debugging with Best-first Tree Search

Jialin Song; Jonathan Raiman; Bryan Catanzaro

Effective Large Language Model Debugging with Best-first Tree Search

Jialin Song, Jonathan Raiman, Bryan Catanzaro

TL;DR

The paper addresses the challenge of debugging code generated by large language models by introducing BESTER, a best-first tree search framework that interleaves program generation, execution feedback, and self-reflection-based repairs. By sampling multiple self-reflections and selecting the best repair at each step, BESTER achieves state-of-the-art Pass@1 on HumanEval, MBPP, and APPS across multiple models, and maintains gains under equal compute via Pass@Infer. The authors provide an interpretability study showing self-reflections focus on specific buggy lines and drive targeted edits, along with ablations that reveal how design choices (depth, breadth, and selection rules) influence performance. The work highlights practical implications for iterative code synthesis with LLMs and suggests future directions for scaling to larger real-world codebases and integrating BESTER into broader reasoning and agent-based frameworks.

Abstract

Large Language Models (LLMs) show promise in code generation tasks. However, their code-writing abilities are often limited in scope: while they can successfully implement simple functions, they struggle with more complex tasks. A fundamental difference with how an LLM writes code, compared to a human programmer, is that it cannot consistently spot and fix bugs. Debugging is a crucial skill for programmers and it enables iterative code refinement towards a correct implementation. In this work, we propose a novel algorithm to enable LLMs to debug their code via self-reflection and search where a model attempts to identify its previous mistakes. Our key contributions are 1) a best-first tree search algorithm with self-reflections (BESTER) that achieves state-of-the-art Pass@1 in three code generation benchmarks. BESTER maintains its superiority when we measure pass rates taking into account additional inference costs incurred by tree search. 2) A novel interpretability study on what self-reflections attend to in buggy programs and how they impact bug fixes, which provides a deeper understanding of the debugging process. 3) An extensive study on when self-reflections are effective in finding bugs.

Effective Large Language Model Debugging with Best-first Tree Search

TL;DR

Abstract

Paper Structure (31 sections, 7 figures, 7 tables, 1 algorithm)

This paper contains 31 sections, 7 figures, 7 tables, 1 algorithm.

Introduction
Related Works
LLMs for Code Generation
Logical Reasoning in LLMs
Methodology
Program Generation
Execution Evaluation
Self-reflection Generation
Best Self-reflection Tree Search (BESTER)
Experiments
Experiment Setup
Models
Datasets
Metrics
Baselines
...and 16 more sections

Figures (7)

Figure 1: BESTER interleaves 1) program generation, 2) execution evaluation, and 3) self-reflection generation to refine programs. We sample multiple self-reflections based on a buggy program. Then an LLM generates a program repair based on each self-reflection. We select the repair with the highest score as the Best Repair and call its corresponding self-reflection the Best Self-reflection (highlighted in blue). If the best repair still fails some tests, we repeat self-reflection generations and program repairs until reaching a pre-determined maximal depth.
Figure 2: Pass@Infer results with different tree search depth and number of self-reflection configurations. Empirical results suggest choosing a depth of 2 and sampling 5 self-reflections at each step.
Figure 3: An example self-reflection and repair program from BESTER with Deepseek. We highlight the diff line in red.
Figure 4: Mean normalized rank distributions for attribution scores that measure how much self-reflections depend on buggy programs. Lower values represent higher attribution scores. Lines that will be changed in the edited program have more influence on the self-reflection than lines that will remain the same.
Figure 5: Mean normalized rank distributions attribution scores that measure how much repair programs depend on self-reflections. Lower values represent higher attribution scores. Self-reflections cause targeted code diff edits.
...and 2 more figures

Effective Large Language Model Debugging with Best-first Tree Search

TL;DR

Abstract

Effective Large Language Model Debugging with Best-first Tree Search

Authors

TL;DR

Abstract

Table of Contents

Figures (7)