Table of Contents
Fetching ...

AuPair: Golden Example Pairs for Code Repair

Aditi Mavalankar, Hassan Mansoor, Zita Marinho, Masha Samsikova, Tom Schaul

TL;DR

The paper tackles improving code repair under limited inference-time compute by using self-repair with in-context learning through AuPair, an algorithm that constructs an ordered set of golden example pairs formed by an initial guess and its fix and uses them as 1-shot prompts for multiple LLM calls. It comprises two phases: Phase 1 collects a large candidate pool of repair pairs by iteratively repairing guesses, and Phase 2 extracts an ordered AuPair list via a submodular greedy selection that maximizes complementarity and problem coverage under a budget $N$. Empirically, AuPair consistently outperforms best-of-$N$ and self-repair across 5 models and 7 datasets, demonstrates strong scaling with compute, and generalizes to out-of-distribution datasets and to cross-model settings, while preserving diverse problem coverage. The work significantly reduces the compute needed to obtain high-quality repaired code, offering a practical, scalable approach to self-repair that can extend to other tasks beyond coding.

Abstract

Scaling up inference-time compute has proven to be a valuable strategy in improving the performance of Large Language Models (LLMs) without fine-tuning. An important task that can benefit from additional inference-time compute is self-repair; given an initial flawed response, or guess, the LLM corrects its own mistake and produces an improved response, or fix. We leverage the in-context learning ability of LLMs to perform self-repair in the coding domain. The key contribution of our paper is an approach that synthesises and selects an ordered set of golden example pairs, or AuPairs, of these initial guesses and subsequent fixes for the corresponding problems. Each such AuPair is provided as a single in-context example at inference time to generate a repaired solution. For an inference-time compute budget of $N$ LLM calls per problem, $N$ AuPairs are used to generate $N$ repaired solutions, out of which the highest-scoring solution is selected as the final answer. The underlying intuition is that if the LLM is given a different example of fixing an incorrect guess each time, it can subsequently generate a diverse set of repaired solutions. Our algorithm selects these AuPairs in a manner that maximises complementarity and usefulness. We demonstrate the results of our algorithm on 5 LLMs across 7 competitive programming datasets for the code repair task. Our algorithm yields a significant boost in performance compared to best-of-$N$ and self-repair, and also exhibits strong generalisation across datasets and models. Moreover, our approach shows significantly stronger scaling with inference-time compute budget compared to baselines.

AuPair: Golden Example Pairs for Code Repair

TL;DR

The paper tackles improving code repair under limited inference-time compute by using self-repair with in-context learning through AuPair, an algorithm that constructs an ordered set of golden example pairs formed by an initial guess and its fix and uses them as 1-shot prompts for multiple LLM calls. It comprises two phases: Phase 1 collects a large candidate pool of repair pairs by iteratively repairing guesses, and Phase 2 extracts an ordered AuPair list via a submodular greedy selection that maximizes complementarity and problem coverage under a budget . Empirically, AuPair consistently outperforms best-of- and self-repair across 5 models and 7 datasets, demonstrates strong scaling with compute, and generalizes to out-of-distribution datasets and to cross-model settings, while preserving diverse problem coverage. The work significantly reduces the compute needed to obtain high-quality repaired code, offering a practical, scalable approach to self-repair that can extend to other tasks beyond coding.

Abstract

Scaling up inference-time compute has proven to be a valuable strategy in improving the performance of Large Language Models (LLMs) without fine-tuning. An important task that can benefit from additional inference-time compute is self-repair; given an initial flawed response, or guess, the LLM corrects its own mistake and produces an improved response, or fix. We leverage the in-context learning ability of LLMs to perform self-repair in the coding domain. The key contribution of our paper is an approach that synthesises and selects an ordered set of golden example pairs, or AuPairs, of these initial guesses and subsequent fixes for the corresponding problems. Each such AuPair is provided as a single in-context example at inference time to generate a repaired solution. For an inference-time compute budget of LLM calls per problem, AuPairs are used to generate repaired solutions, out of which the highest-scoring solution is selected as the final answer. The underlying intuition is that if the LLM is given a different example of fixing an incorrect guess each time, it can subsequently generate a diverse set of repaired solutions. Our algorithm selects these AuPairs in a manner that maximises complementarity and usefulness. We demonstrate the results of our algorithm on 5 LLMs across 7 competitive programming datasets for the code repair task. Our algorithm yields a significant boost in performance compared to best-of- and self-repair, and also exhibits strong generalisation across datasets and models. Moreover, our approach shows significantly stronger scaling with inference-time compute budget compared to baselines.

Paper Structure

This paper contains 27 sections, 3 equations, 19 figures, 2 tables, 1 algorithm.

Figures (19)

  • Figure 1: An example AuPair: The LLM-generated guess and fix, along with their respective scores for the corresponding CodeForces problem (problem description at the top). The guess checks only the first digit for every single number leading up to the input. The fix corrects the logic by iterating over the divisors of the input, and checking for an intersection over all digits with the input. To provide this AuPair in context at inference time, the problem description, guess, and fix, are concatenated as described in §\ref{['fig:repair_prompt']}.
  • Figure 2: Pair Generation: This phase includes collecting a large set $\mathcal{C}$ of guesses for coding problems and their fixes , yielding candidate pairs that will later be used to get AuPairs. At each step, a problem with its guess is sampled from the training dataset, and used in conjunction with $k$ randomly sampled pairs from the candidate pair buffer to compose a $k$-shot prompt. This prompt is then passed through an LLM to generate a fix, which is evaluated on the unit tests by running the Python interpreter and computing its test pass rate. If this fix is better than the guess, this (guess, fix) pair is added to the set of candidate pairs. Any improved but imperfect fix is also added as a new guess to the training dataset. See §\ref{['sec:phase1']} for more details.
  • Figure 3: AuPair Extraction: given a large set of candidate pairs , the next step is to extract AuPairs from them. For this, each pair is provided as a 1-shot in-context example in the prompt for each problem and its guess from the validation dataset. These prompts are then passed to the LLM which generates fixes that are evaluated on the corresponding unit tests to populate a fix-quality matrix, as described in Algorithm \ref{['alg:fix-quality']}. Following this, a submodular selection mechanism is applied on this fix-quality matrix to obtain the list of AuPairs, as described in Algorithm \ref{['alg:submodular']}.
  • Figure 4: In-distribution code repair performance: with $N = 32$ LLM calls at inference time and the same train / val / test data distribution, we compute the test pass rate. The same model is used for generating the initial guesses and fixes and the AuPair extraction. CodeForces (left, 8.8k problems) and AtCoder (right, 1.3k problems), see §\ref{['sec:in_dist_performance']} for more details.
  • Figure 5: (a) AuPairs vs. random pairs: AuPairs (green) are significantly (about $2.5-3\times$) more compute-efficient than random pairs (red); it takes only 12 AuPairs to reach the same performance as 32 random pairs; (b) Scaling inference-time compute: using AuPairs the score increases with compute budget at a much steeper rate compared to baselines (CodeForces dataset, Gemini-1.5-Pro).
  • ...and 14 more figures