Table of Contents
Fetching ...

When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

Jack Lu, Ryan Teehan, Jinran Jin, Mengye Ren

TL;DR

The paper performs a large-scale, cross-family study of LLM-based verification, introducing Verifier Gain to quantify test-time improvements from verifier-based rejection sampling. It shows cross-family verification often yields the largest gains, while stronger solvers and post-training reduce self-verification usefulness and shift gains toward cross-family settings. The work reveals that verifier bias toward solutions resembling the solver decreases as distribution similarity declines, and identifies task types (logical/mathematical and synthetic puzzles) as inherently easier to verify. These results offer practical guidance for deploying verifiers and motivate further investigation into task verifiability and bias origins. Overall, the study provides a rigorous framework and actionable insights for optimizing solver-verifier interactions in diverse LLM ecosystems.

Abstract

Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. However, prior studies of solver-verifier interactions have been limited, focusing mainly on self-verification and rarely examining how verifiers judge outputs from models in their own or in another model family. Modern LLMs also undergo extensive post-training, but its effect on verification remains unclear. We present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. We compare self-verification with verification within the same family and across different families. To support this, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability. Our findings show that cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; and mathematical and logical tasks exhibit the highest inherent verifiability.

When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

TL;DR

The paper performs a large-scale, cross-family study of LLM-based verification, introducing Verifier Gain to quantify test-time improvements from verifier-based rejection sampling. It shows cross-family verification often yields the largest gains, while stronger solvers and post-training reduce self-verification usefulness and shift gains toward cross-family settings. The work reveals that verifier bias toward solutions resembling the solver decreases as distribution similarity declines, and identifies task types (logical/mathematical and synthetic puzzles) as inherently easier to verify. These results offer practical guidance for deploying verifiers and motivate further investigation into task verifiability and bias origins. Overall, the study provides a rigorous framework and actionable insights for optimizing solver-verifier interactions in diverse LLM ecosystems.

Abstract

Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. However, prior studies of solver-verifier interactions have been limited, focusing mainly on self-verification and rarely examining how verifiers judge outputs from models in their own or in another model family. Modern LLMs also undergo extensive post-training, but its effect on verification remains unclear. We present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. We compare self-verification with verification within the same family and across different families. To support this, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability. Our findings show that cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; and mathematical and logical tasks exhibit the highest inherent verifiability.

Paper Structure

This paper contains 35 sections, 4 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Average solver accuracy of each model over all datasets. Base model families are suffixed by -Base. Models within each family are ordered in increasing sizes.
  • Figure 2: Correlation between each verifier's metrics (rows) and its own solver accuracy for all 21 post-trained models, averaged over all datasets. Each verifier metric is computed over our three verification settings (columns).
  • Figure 3: Correlation between each verifier's metrics (rows) and model size for all 21 post-trained models, averaged over all datasets. In each plot, models are separated by family and ordered by increasing size. Each verifier metric is computed over our three verification settings (columns).
  • Figure 4: Comparison between theoretical and empirical verifier gains (rows) for each verification setting (columns). Row 1 shows verifier gains computed from Equation \ref{['eqn:verifiergain']}. Rows 2 and 3 each show the gains from rejection sampling, computed from rejection sampling using verifiers for up to 5 and 9 solver attempts, respectively.
  • Figure 5: Correlation between verifier metrics with similarity scores between solver-verifier pairs. Each marker is colored based on the verifier model family.
  • ...and 6 more figures