Table of Contents
Fetching ...

Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

Eric Zhao, Pranjal Awasthi, Sreenivas Gollapudi

TL;DR

The paper investigates how test-time compute can scale sampling-based search with self-verification to enhance reasoning. It demonstrates that simple, parallelized sampling and self-verification exhibit sustained power-law performance improvements on challenging benchmarks, driven in part by implicit scaling. Two actionable verification principles—comparing across candidate responses and rewriting outputs for structured verification—substantially boost accuracy, while frontier models exhibit weak out-of-the-box verification, motivating a new benchmark. Overall, the work positions sampling-based search as a strong, scalable baseline for inference-time computation and offers practical guidance for leveraging verification at scale.

Abstract

Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by having models self-verify each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation of sampling-based search, using only random sampling and direct self-verification, provides a practical inference method that, for example, elevates the reasoning capabilities of Gemini v1.5 Pro above that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves self-verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.

Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

TL;DR

The paper investigates how test-time compute can scale sampling-based search with self-verification to enhance reasoning. It demonstrates that simple, parallelized sampling and self-verification exhibit sustained power-law performance improvements on challenging benchmarks, driven in part by implicit scaling. Two actionable verification principles—comparing across candidate responses and rewriting outputs for structured verification—substantially boost accuracy, while frontier models exhibit weak out-of-the-box verification, motivating a new benchmark. Overall, the work positions sampling-based search as a strong, scalable baseline for inference-time computation and offers practical guidance for leveraging verification at scale.

Abstract

Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by having models self-verify each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation of sampling-based search, using only random sampling and direct self-verification, provides a practical inference method that, for example, elevates the reasoning capabilities of Gemini v1.5 Pro above that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves self-verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.

Paper Structure

This paper contains 48 sections, 19 theorems, 3 equations, 5 figures, 9 tables, 1 algorithm.

Key Result

Theorem 1

The number of rectangles that can be formed inside a fixed regular dodecagon, where each side of the rectangle lies on either a side or a diagonal of the dodecagon, is $\boxed{198}$.

Figures (5)

  • Figure 2.1: Heatmap of Gemini v1.5 Pro accuracy rates using sampling-based search (without tie-breaking) as the number of responses generated (x-axis) and verification attempts (y-axis) increase. Warmer colors indicate higher accuracy (cubic scale). The largest gains occur when scaling both search and verification, with the strongest trend on AIME.
  • Figure 2.2: Plot of Gemini v1.5 Pro accuracy rates using sampling-based search (without tie-breaking and with $k_{\mathrm{verif}} = 50$) on ambiguous questions only as the number of responses generated increases. A question is ambiguous when the model generates at least one candidate response with a correct final answer. Accuracy on ambiguous questions increases with search.
  • Figure 2.3: Heatmap of Gemini v1.5 Pro accuracy rates using sampling-based search (without tie-breaking) on ambiguous questions only as the number of responses generated (x-axis) and verification attempts (y-axis) increase. Warmer colors indicate higher accuracy (linear scale). A question is ambiguous when the model generates at least one candidate response with a correct final answer. Accuracy on ambiguous questions increases with search (x-axis).
  • Figure 2.4: Line graph depicting the accuracy rates of the Gemini v1.5 Pro model using sampling-based search as the number of candidate responses generated is scaled upwards. The number of verification attempts is fixed at 50 for all plots. The depicted accuracies are obtained without tie-breaking and may be lower than reported elsewhere. Verification@k improves with $k$ even when Consistency@k stagnates on AIME and LiveBench Reasoning.
  • Figure 6.1: Example of an entry in our verification benchmark. The question is sourced from the LiveBench Reasoning benchmark, and the two responses are generated by Gemini v1.5 Pro. The green response has the correct final answer; the red response has the wrong final answer due to hallucinating a non-existent clause.

Theorems & Definitions (38)

  • Theorem 1: Main Claim
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • ...and 28 more