Table of Contents
Fetching ...

Shrinking the Generation-Verification Gap with Weak Verifiers

Jon Saad-Falcon, E. Kelly Buchanan, Mayee F. Chen, Tzu-Heng Huang, Brendan McLaughlin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, Azalia Mirhoseini, Christopher Ré

TL;DR

<3-5 sentence high-level summary> Weaver addresses the generation–verification gap by aggregating many weak verifiers through a weak supervision framework to produce calibrated verification signals without extensive labeled data. It demonstrates that weighted verifier ensembles significantly outperform naive averaging and major baselines, and that distilling Weaver into a compact cross-encoder can preserve most gains with substantial compute savings. The approach scales across model sizes, verifier counts, and test-time generations, approaching frontier model performance on reasoning and math tasks. This work enables scalable, data-efficient, and compute-efficient verification to improve data filtering, model alignment, and inference-time decision-making.

Abstract

Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier's accuracy and combines outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these using dataset statistics to normalize outputs and filter specific verifiers. We study Weaver's effectiveness in test-time repeated sampling, where a model generates multiple candidate responses and selects one. Our evaluations show Weaver significantly improves over Pass@1-performance when selecting the first candidate-across reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as generator, and an ensemble of 70B or smaller judge and reward models as verifiers (87.7% average). This gain mirrors the jump between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training. To reduce computational costs of verifier ensembles, we train a 400M cross-encoder using Weaver's combined output scores.

Shrinking the Generation-Verification Gap with Weak Verifiers

TL;DR

<3-5 sentence high-level summary> Weaver addresses the generation–verification gap by aggregating many weak verifiers through a weak supervision framework to produce calibrated verification signals without extensive labeled data. It demonstrates that weighted verifier ensembles significantly outperform naive averaging and major baselines, and that distilling Weaver into a compact cross-encoder can preserve most gains with substantial compute savings. The approach scales across model sizes, verifier counts, and test-time generations, approaching frontier model performance on reasoning and math tasks. This work enables scalable, data-efficient, and compute-efficient verification to improve data filtering, model alignment, and inference-time decision-making.

Abstract

Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier's accuracy and combines outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these using dataset statistics to normalize outputs and filter specific verifiers. We study Weaver's effectiveness in test-time repeated sampling, where a model generates multiple candidate responses and selects one. Our evaluations show Weaver significantly improves over Pass@1-performance when selecting the first candidate-across reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as generator, and an ensemble of 70B or smaller judge and reward models as verifiers (87.7% average). This gain mirrors the jump between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training. To reduce computational costs of verifier ensembles, we train a 400M cross-encoder using Weaver's combined output scores.

Paper Structure

This paper contains 53 sections, 34 equations, 27 figures, 22 tables.

Figures (27)

  • Figure 1: Weaver Framework: We propose Weaver, a framework combining multiple weak verifiers to effectively scale repeated sampling without parameter finetuning on ground truth labels (left). Weaver significantly outperforms majority voting and shrinks a model's generation-verification gap by 14.5%, on average, for GPQA Diamond and other datasets (Table \ref{['tab:verifier_ablations']}) (middle). By distilling Weaver from an ensemble of 70B verifiers to a single 400M cross-encoder, we can preserve 98.2% of the accuracy gains of Weaver while reducing inference compute cost by 99.97% (right).
  • Figure 2: Weighted Verifier Ensembles Outperform Naive Verifier Ensembles: By using oracle data to keep the best verifiers (i.e. top-$K$ verifier ensembles) or learn aggregation weights for verifiers (i.e. supervised weighted ensembles), we can improve beyond naive combinations of the verifiers available by 3.6% and 7.8%, on average, respectively.
  • Figure 3: Scaling Generations Boosts Performance with Weaver: The generation-verification gap shrinks when increasing $K$ and leveraging Weaver, outperforming alternative verification methods by an average 18.3%.
  • Figure 4: Weaver Outperforms Naive Ensemble across Oracle Top-5 Verifiers and Total Verifiers Configurations: Results are shown for Weaver ensembles and naive ensembles of the Oracle Top-5 Verifiers (highest-performing verifiers on dataset selected using ground truth) and Total Verifiers (all available verifiers). Weaver consistently outperforms naive ensemble averaging, with improvements ranging from +2.4% to +10.1%.
  • Figure 5: Weaver Improves the Accuracy-Compute Performance Trade-Offs. Success rate ($\%$) as a function of total inference compute per query (generation and verification compute, log scaled) for different verification strategies. Each point represents a different number of candidate generations (from $2^0$ to $2^{7}$). Weaver achieves the highest accuracy while requiring more compute than Majority Voting but demonstrates continued scaling benefits, while Weaver Distilled maintains most of Weaver's performance gains with 97.3% compute savings and substantial accuracy improvements over baseline methods.
  • ...and 22 more figures