Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification
Eric Zhao, Pranjal Awasthi, Sreenivas Gollapudi
TL;DR
The paper investigates how test-time compute can scale sampling-based search with self-verification to enhance reasoning. It demonstrates that simple, parallelized sampling and self-verification exhibit sustained power-law performance improvements on challenging benchmarks, driven in part by implicit scaling. Two actionable verification principles—comparing across candidate responses and rewriting outputs for structured verification—substantially boost accuracy, while frontier models exhibit weak out-of-the-box verification, motivating a new benchmark. Overall, the work positions sampling-based search as a strong, scalable baseline for inference-time computation and offers practical guidance for leveraging verification at scale.
Abstract
Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by having models self-verify each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation of sampling-based search, using only random sampling and direct self-verification, provides a practical inference method that, for example, elevates the reasoning capabilities of Gemini v1.5 Pro above that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves self-verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.
