Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

Eric Zhao; Pranjal Awasthi; Sreenivas Gollapudi

Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

Eric Zhao, Pranjal Awasthi, Sreenivas Gollapudi

TL;DR

The paper investigates how test-time compute can scale sampling-based search with self-verification to enhance reasoning. It demonstrates that simple, parallelized sampling and self-verification exhibit sustained power-law performance improvements on challenging benchmarks, driven in part by implicit scaling. Two actionable verification principles—comparing across candidate responses and rewriting outputs for structured verification—substantially boost accuracy, while frontier models exhibit weak out-of-the-box verification, motivating a new benchmark. Overall, the work positions sampling-based search as a strong, scalable baseline for inference-time computation and offers practical guidance for leveraging verification at scale.

Abstract

Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by having models self-verify each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation of sampling-based search, using only random sampling and direct self-verification, provides a practical inference method that, for example, elevates the reasoning capabilities of Gemini v1.5 Pro above that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves self-verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.

Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

TL;DR

Abstract

Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (38)