Table of Contents
Fetching ...

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Hui Sun, Yun-Ji Zhang, Zheng Xie, Ren-Biao Liu, Yali Du, Xin-Ye Li, Ming Li

Abstract

Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose \textbf{ACES}~(\textbf{A}UC \textbf{C}onsist\textbf{E}ncy \textbf{S}coring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@$k$ on multiple code generation benchmarks.

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Abstract

Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose \textbf{ACES}~(\textbf{A}UC \textbf{C}onsist\textbf{E}ncy \textbf{S}coring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@ on multiple code generation benchmarks.

Paper Structure

This paper contains 48 sections, 9 theorems, 61 equations, 14 figures, 9 tables, 2 algorithms.

Key Result

Theorem 2

Define the mean signal and signal-to-noise ratio of weights $w$: For any $w$ with $M(w) > 0$ and any $k \geq 1$: where $n^- = n - n^+$ is the number of incorrect codes. Over non-negative weights, $R(w)$ is maximized by $w^*_j \propto \max(0,\;\delta_j)$. $\blacktriangleleft$$\blacktriangleleft$

Figures (14)

  • Figure 1: Vote types for a test $t_j$ on a correct-incorrect pair $(c^+, c^-)$. Each test casts a vote $h_j = B_{c^+,j} - B_{c^-,j}$ on their ordering; the score difference $s_{c^+}\!-\!s_{c^-} = \sum_j w_j h_j$ is the weighted aggregate.
  • Figure 2: ACES on two constructed $8 \times 10$ instances (full data in Appendix \ref{['app:example']}). Test colors: green = perfect, blue = constructive/permissive, red = misleading. (a) Pass matrix. (b),(c) $\mathrm{LOO\text{-}AUC}_j$ (upper) and weights $w_j$ (lower, purple). (d) Scores; gold/silver/bronze mark the top-3 codes.
  • Figure 3: Empirical analysis on MBPP at Pass@1.(a) Tasks binned by $\bar{\delta}$; bars show pass/fail counts, lines show pass rate; bottom panel compares pass rates by assumption status. (b) Performance impact of each $\delta_j$ bin upon removing its tests; full results in Appendix \ref{['app:assumption']}.
  • Figure 4: Test quality detection. ACES-C weight $(\mathrm{LOO\text{-}AUC}_j - \tfrac{1}{2})\,p_j(1{-}p_j)$ vs. ground-truth discriminative power $\delta_j$ for all non-trivial tests across three benchmarks. Green and red points denote informative ($\delta_j > 0$) and misleading ($\delta_j < 0$) tests, respectively; top marginals show the per-bin classification breakdown, with errors (FP, FN) in saturated colors and correct classifications (TP, TN) in lighter shades. Quadrant percentages report the fraction of tests classified as TP/FP/FN/TN by the sign of the ACES-C weight (summing to 100%).
  • Figure 5: Pass@$k$ vs. $k$ ($k = 1, \ldots, 20$) on all three benchmarks, combined with $\mathcal{DS}^3$ pre-filtering. ACES-C + $\mathcal{DS}^3$ leads at small $k$ on HumanEval/HumanEval$^+$; the two variants converge at larger $k$. On MBPP they perform comparably throughout.
  • ...and 9 more figures

Theorems & Definitions (15)

  • Definition 1: Discriminative Power
  • Theorem 2: Pass@k Bound
  • Theorem 3: LOO-AUC Identity
  • Proposition 5: Structure of $c_j(w_{\mathrm{unif}})$
  • Theorem 6: ACES-C Recovers Discriminative Power
  • proof
  • Lemma 7: Hoeffding's inequality hoeffding1963probability
  • Theorem : \ref{['thm:hoeffding']}, restated
  • proof
  • Theorem : \ref{['lem:loo-identity']}, restated
  • ...and 5 more