Table of Contents
Fetching ...

ROC-n-reroll: How verifier imperfection affects test-time scaling

Florian E. Dorner, Yatong Chen, André F. Cruz, Fanny Yang

TL;DR

This work provides a theoretical and empirical study of test-time scaling with imperfect verifiers, focusing on Rejection Sampling and Best-of-N. It shows that for a fixed query, the per-instance accuracy of RS and BoN is fully determined by the base generator accuracy $\pi$ and the verifier ROC curve $T(F)$, with RS outperforming BoN at the same compute and both converging in the infinite-compute limit. Crucially, the authors prove that high- and low-compute performance cannot be reliably extrapolated from one another in the presence of imperfect verifiers; small shifts in the ROC near the origin can drastically alter high-budget outcomes, and early scaling offers no guaranteed signal of ultimate performance. Experiments using Qwen and LLama verifiers on GSM8K and MATH500 validate the predictions, showing RS's practical advantages and highlighting the limits of extrapolation across compute regimes. These insights inform the design of verifier-based test-time strategies and motivate hybrid or budget-aware approaches that adapt to ROC geometry.

Abstract

Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier imperfection affects performance -- a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifier's ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on observations in the low-compute regime.

ROC-n-reroll: How verifier imperfection affects test-time scaling

TL;DR

This work provides a theoretical and empirical study of test-time scaling with imperfect verifiers, focusing on Rejection Sampling and Best-of-N. It shows that for a fixed query, the per-instance accuracy of RS and BoN is fully determined by the base generator accuracy and the verifier ROC curve , with RS outperforming BoN at the same compute and both converging in the infinite-compute limit. Crucially, the authors prove that high- and low-compute performance cannot be reliably extrapolated from one another in the presence of imperfect verifiers; small shifts in the ROC near the origin can drastically alter high-budget outcomes, and early scaling offers no guaranteed signal of ultimate performance. Experiments using Qwen and LLama verifiers on GSM8K and MATH500 validate the predictions, showing RS's practical advantages and highlighting the limits of extrapolation across compute regimes. These insights inform the design of verifier-based test-time strategies and motivate hybrid or budget-aware approaches that adapt to ROC geometry.

Abstract

Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier imperfection affects performance -- a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifier's ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on observations in the low-compute regime.

Paper Structure

This paper contains 47 sections, 19 theorems, 143 equations, 9 figures, 1 table.

Key Result

Proposition 1

Let $f$ be a score and $\text{T}: [0,1] \mapsto [0,1]$ be the ROC curve of $f$. If the derivative $\text{T}'(\text{F})$ exists at $\text{F}$, the derivative of the accuracy-compute curve at $C(F)$ is given by For (strictly) concave ROC curves, $\frac{dA(C) }{dC}{|}_{C=C(\text{F})}$ is (strictly) positive whenever $\text{T}'(\text{F})$ exists.

Figures (9)

  • Figure 1: Empirical performance (markers) of RS (middle) and BoN (right) on GSM8K test question $58$, overlaid with theoretical predictions (lines). Different verifiers scale similarly at first, but then diverge. RS matches BoN accuracy, using less average compute. Generator: Qwen3-1.7B.
  • Figure 2: Performance of RS (line) and BoN (scatter) with different verifiers (synthetic data).
  • Figure 3: Empirical performance ($\mathsf{x}$ markers) of RS (purple) and BoN (olive) on GSM8K test question $2$, overlaid with theoretical predictions (lines). Dotted: Llama-3.2-3B as verifier (single COT). Solid: Llama-4-17B-16E as verifier (single COT). Controlling for the number of generated samples, RS consistently outperforms BoN for both verifiers. Generator: Llama-3.2-3B.
  • Figure 4: Aggregate accuracy of BoN and RS on MATH500 (left plot) and GSM8K (right plot). The rightmost RS points for each verifier represent the maximal threshold $\tau=1$. Dotted lines show the maximal RS performance for the respective verifiers. In both cases, BoN initially underperforms RS, but matches RS performance at higher compute levels. Verifier models: Qwen3-32B (blue), Qwen3-4B (orange). Generator: Qwen3-1.7B. Error bars: Exact $90\%$ CIs for accuracy.
  • Figure F.1: Empirical performance (lines) of rejection sampling (middle) and BoN (right) on a GSM8K test question ($i=2$), overlaid with predicted theoretical performance ($\mathsf{x}$ markers). Verification score obtained from a single chain of thought. Generator: Qwen3-1.7B.
  • ...and 4 more figures

Theorems & Definitions (46)

  • Definition 1
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Proposition 6
  • Theorem 1
  • Proposition 7
  • Proposition 8
  • ...and 36 more