TGPR: Tree-Guided Policy Refinement for Robust Self-Debugging of LLMs
Daria Ozerova, Ekaterina Trofimova
TL;DR
Iterative refinement is hampered by exploration in the large space of code repairs. TGPR adds a Thompson Sampling-guided tree search as a training-time data augmentation engine to guide exploration for GRPO, using a hybrid reward $R(\rho)$ that combines CodeBLEU with functional success and a Beta-distribution-based tree selection strategy. The method yields substantial improvements over a strong GRPO baseline across MBPP, HumanEval, and APPS, including an absolute pass@1 gain up to 4.2 percentage points and pass@10 gains up to 12.51 percentage points. This work provides a principled framework for combining learned policies with structured search to improve robust self-debugging in LLMs and demonstrates the practical value for code generation tasks.
Abstract
Iterative refinement has been a promising paradigm to enable large language models (LLMs) to resolve difficult reasoning and problem-solving tasks. One of the key challenges, however, is how to effectively search through the enormous search space of possible refinements. Existing methods typically fall back on predefined heuristics, which are troubled by the exploration-exploitation dilemma and cannot adapt based on past refinement outcomes. We introduce Tree-Guided Policy Refinement (TGPR), a novel framework that combines GRPO with a Thompson-Sampling-based tree search. TGPR explores both failed and successful refinement paths actively, with denser training trajectories and more adaptive policies. On HumanEval, MBPP, and APPS benchmarks, our method achieves up to +4.2 percentage points absolute improvement in pass@1 (on MBPP) and up to +12.51 percentage points absolute improvement in pass@10 (on APPS) compared to a competitive GRPO baseline. Apart from debugging code, TGPR focuses on a principled approach to combining learned policies with structured search methods, offering a general framework for enhancing iterative refinement and stateful reasoning in LLMs.
