Code Repair with LLMs gives an Exploration-Exploitation Tradeoff

Hao Tang; Keya Hu; Jin Peng Zhou; Sicheng Zhong; Wei-Long Zheng; Xujie Si; Kevin Ellis

Code Repair with LLMs gives an Exploration-Exploitation Tradeoff

Hao Tang, Keya Hu, Jin Peng Zhou, Sicheng Zhong, Wei-Long Zheng, Xujie Si, Kevin Ellis

TL;DR

It is shown here that refinement exposes an explore-exploit tradeoff: exploit by refining the program that passes the most test cases, or explore by refining a lesser considered program.

Abstract

Iteratively improving and repairing source code with large language models (LLMs), known as refinement, has emerged as a popular way of generating programs that would be too complex to construct in one shot. Given a bank of test cases, together with a candidate program, an LLM can improve that program by being prompted with failed test cases. But it remains an open question how to best iteratively refine code, with prior work employing simple greedy or breadth-first strategies. We show here that refinement exposes an explore-exploit tradeoff: exploit by refining the program that passes the most test cases, or explore by refining a lesser considered program. We frame this as an arm-acquiring bandit problem, which we solve with Thompson Sampling. The resulting LLM-based program synthesis algorithm is broadly applicable: Across loop invariant synthesis, visual reasoning puzzles, and competition programming problems, we find that our new method can solve more problems using fewer language model calls.

Code Repair with LLMs gives an Exploration-Exploitation Tradeoff

TL;DR

It is shown here that refinement exposes an explore-exploit tradeoff: exploit by refining the program that passes the most test cases, or explore by refining a lesser considered program.

Abstract

Paper Structure (46 sections, 10 equations, 17 figures, 4 tables, 1 algorithm)

This paper contains 46 sections, 10 equations, 17 figures, 4 tables, 1 algorithm.

Introduction
Background: Bandits and Thompson Sampling
Problem Statement and Assumptions
Definitions: Specification, $\mathbf{(\vdash)}$.
Definition: Counterexamples.
Refinement.
Heuristic measures of progress.
REx: Refine, Explore, Exploit
Understanding the behavior of REx.
Experimental Results
Problems solved as a function of compute budget.
Solving hard problems.
Hyperparameter sensitivity.
Related Work
Code refinement.
...and 31 more sections

Figures (17)

Figure 1: Left: The tree of possible refinements is infinitely deep and has infinite branching factor. Each node is a program and each edge is an LLM sample. Right: Explore-Exploit tradeoff for a search state after performing 3 node expansions. Exploit by sampling another child of a program that is nearly correct, or Explore by sampling a child of a program that has been expanded fewer times.
Figure 2: How the model's beliefs about the benefit of refining a program, $\theta$, change as we vary (1) $N$, the number of times it was previously refined, and (2) $h$, the heuristic estimate of how close we are to satisfying the specification (larger $h$ is better). Left: Expected benefit of refining decreases the more we refine, and asymptotically decays to zero (Eq. \ref{['eq:expecteddecay']}). Middle/Right: Posterior beliefs initially center around $h$ and shift toward zero with each additional refinement. Same colored curves show same values of $h$ for different values of $N$. The hyperparameter $C$ modulates the rate of decay with each additional refinement, and also affects the initial concentration of the density around $h$.
Figure 3: Evaluation domains. For visual reasoning, the goal is to synthesize an image-manipulating program that translates input images to the correct outputs. For software verification, the goal is to synthesize logical conditions as a valid loop invariant, in order to formally verify the functionality of the code. For competition programming, the goal is to generate an algorithm in Python.
Figure 4: Comparing REx with alternatives using GPT-4 (temp=1). BFS and FW are Breadth First Search and Fixed Width, respectively. AUC denotes Area Under the Curve, and Final denotes the success rate at the maximum # LLM calls (64 for ARC and 300 for others due to domain conventions). Dark lines show performance with the best hyper-parameter setting for each method. Light lines show each run on each hyperparameter. The inset box plots show the distribution while varying the hyper-parameters. APPS baselines: Parsel zelikman2022parsel, AlphaCode li2022competition, and olausson2023selfrepair. Nonlinear Loop Invariant baselines: Z3/GSpacer 10.1007/978-3-030-53291-8_7 and yao:pldi20. ARC baseline: Hypo. Search wang2023hypothesis. More results on APPS Interview-Level and ARC in Figure \ref{['fig:supplement_arc_results']} and Figure \ref{['fig:apps-interview']}
Figure 5: Comparing REx with alternatives with other LLMs on competition programming (APPS Competition-Level). More results on ARC are available in Appendix in Figure \ref{['fig:curves-and-auc-arc']}.
...and 12 more figures

Code Repair with LLMs gives an Exploration-Exploitation Tradeoff

TL;DR

Abstract

Code Repair with LLMs gives an Exploration-Exploitation Tradeoff

Authors

TL;DR

Abstract

Table of Contents

Figures (17)