When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

Jack Lu; Ryan Teehan; Jinran Jin; Mengye Ren

When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

Jack Lu, Ryan Teehan, Jinran Jin, Mengye Ren

TL;DR

The paper performs a large-scale, cross-family study of LLM-based verification, introducing Verifier Gain to quantify test-time improvements from verifier-based rejection sampling. It shows cross-family verification often yields the largest gains, while stronger solvers and post-training reduce self-verification usefulness and shift gains toward cross-family settings. The work reveals that verifier bias toward solutions resembling the solver decreases as distribution similarity declines, and identifies task types (logical/mathematical and synthetic puzzles) as inherently easier to verify. These results offer practical guidance for deploying verifiers and motivate further investigation into task verifiability and bias origins. Overall, the study provides a rigorous framework and actionable insights for optimizing solver-verifier interactions in diverse LLM ecosystems.

Abstract

Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. However, prior studies of solver-verifier interactions have been limited, focusing mainly on self-verification and rarely examining how verifiers judge outputs from models in their own or in another model family. Modern LLMs also undergo extensive post-training, but its effect on verification remains unclear. We present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. We compare self-verification with verification within the same family and across different families. To support this, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability. Our findings show that cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; and mathematical and logical tasks exhibit the highest inherent verifiability.

When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

TL;DR

Abstract

When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)