Table of Contents
Fetching ...

Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection

Sadegh Mahdavi, Branislav Kisacanin, Shubham Toshniwal, Wei Du, Ivan Moshkov, George Armstrong, Renjie Liao, Christos Thrampoulidis, Igor Gitman

TL;DR

This work addresses the challenge of validating mathematical proofs produced by large language models beyond mere final answers. It contrasts two verification paradigms—single-proof judgments by LLMs and tournament-based proof selection (GenSelect)—and introduces a scalable hybrid framework that combines both for improved accuracy and compute efficiency. Through a rigorous evaluation framework across diverse datasets, the authors show that prompt design heavily influences performance, but reinforcement learning can reduce this sensitivity, particularly for proof judgments. The study provides practical guidelines for designing scalable proof-verification and selection systems, revealing that while proof-level metrics can improve, final-answer precision does not necessarily follow, underscoring the need for careful evaluation and potential human oversight in high-stakes mathematics tasks.

Abstract

Large language models have achieved remarkable success on final-answer mathematical problems, largely due to the ease of applying reinforcement learning with verifiable rewards. However, the reasoning underlying these solutions is often flawed. Advancing to rigorous proof-based mathematics requires reliable proof verification capabilities. We begin by analyzing multiple evaluation setups and show that focusing on a single benchmark can lead to brittle or misleading conclusions. To address this, we evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance. We then scale two major generative verification methods (GenSelect and LLM-as-a-Judge) to millions of tokens and identify their combination as the most effective framework for solution verification and selection. We further show that the choice of prompt for LLM-as-a-Judge significantly affects the model's performance, but reinforcement learning can reduce this sensitivity. However, despite improving proof-level metrics, reinforcement learning does not enhance final-answer precision, indicating that current models often reward stylistic or procedural correctness rather than mathematical validity. Our results establish practical guidelines for designing and evaluating scalable proof-verification and selection systems.

Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection

TL;DR

This work addresses the challenge of validating mathematical proofs produced by large language models beyond mere final answers. It contrasts two verification paradigms—single-proof judgments by LLMs and tournament-based proof selection (GenSelect)—and introduces a scalable hybrid framework that combines both for improved accuracy and compute efficiency. Through a rigorous evaluation framework across diverse datasets, the authors show that prompt design heavily influences performance, but reinforcement learning can reduce this sensitivity, particularly for proof judgments. The study provides practical guidelines for designing scalable proof-verification and selection systems, revealing that while proof-level metrics can improve, final-answer precision does not necessarily follow, underscoring the need for careful evaluation and potential human oversight in high-stakes mathematics tasks.

Abstract

Large language models have achieved remarkable success on final-answer mathematical problems, largely due to the ease of applying reinforcement learning with verifiable rewards. However, the reasoning underlying these solutions is often flawed. Advancing to rigorous proof-based mathematics requires reliable proof verification capabilities. We begin by analyzing multiple evaluation setups and show that focusing on a single benchmark can lead to brittle or misleading conclusions. To address this, we evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance. We then scale two major generative verification methods (GenSelect and LLM-as-a-Judge) to millions of tokens and identify their combination as the most effective framework for solution verification and selection. We further show that the choice of prompt for LLM-as-a-Judge significantly affects the model's performance, but reinforcement learning can reduce this sensitivity. However, despite improving proof-level metrics, reinforcement learning does not enhance final-answer precision, indicating that current models often reward stylistic or procedural correctness rather than mathematical validity. Our results establish practical guidelines for designing and evaluating scalable proof-verification and selection systems.

Paper Structure

This paper contains 15 sections, 52 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Left: Training curves for RL training of Qwen3-30B-A3B-Thinking-2507 on the OPC dataset. (1) Different prompts converge to similar performance after RL training. (2) Including the ground-truth proof or rubric in the prompt improves validation performance by 2–3%. (3) Training with both rubric and ground-truth proof achieves comparable performance. Right: Majority@5 results for the VerProofArena and Challenge-19 evaluation using Qwen3-30B-A3. See Tables \ref{['tab:maj5_results']} and \ref{['tab:maj5_results_proofbench']} in the Appendix for full results, additional models, datasets, and prompt ablations. $a \to b$ denotes trained on $a$ evaluated on $b$. $^*$For final-answer problems, precision and recall are computed based solely on the correctness of the final answer.
  • Figure 2: Best-of-n proof selection results on SelProofBench and SelOPC using GPT-OSS-120B and Qwen3-235B-A22-Thinking-2507. We compare pairwise GenSelect tournament with LLM-as-a-Judge (averaging 32 judgements per proof) using both binary and 7-point grading prompts. Results are averaged across 8 random seeds. No single method consistently outperforms the others; the best approach varies by dataset and model.
  • Figure 3: Left: Overview of our proposed test-time scaling method to combine GenSelect with LLM-as-a-Judge for proof selection. Given a question, we first generate $n_p$ candidate proofs by sampling from a language model. Then we run a knockout GenSelect tournament to select the best proof among the candidates using the same language model to select the top $n_s$ proofs. We estimate the correctness score of each proof using LLM-as-a-Judge by sampling $n_j$ judgements for each proof. Finally, we select the proof with the highest average correctness score as the final proof to solve the problem. Right: The tables show accuracy results for different configurations of our method on GPT-OSS-120B, averaged across 8 seeds on AIME-2025 and Challenge-19. Compute is reported as the total number of LLM calls per problem. The hybrid combination of LLM-as-a-Judge with GenSelect performs substantially better than GenSelect alone, or majority voting.
  • Figure 4: LLM-as-a-Judge performance on selecting the best proof among 8 generated proofs from SelOPC. Higher number of judgements leads to better performance, plateauing around 20-30 judgements. Furthermore, ensembling multiple models does not lead to any performance gain, and often the result does not exceed the strongest LLM, except for ensembling an 8B model trained with RL on the training set of OPC. We suspect that the improvement comes from the fact that the 8B model trained with RL is trained on the same distribution as the test set, and therefore is complementary to other LLMs.
  • Figure 5: GPT-OSS-120B LLM-as-a-Judge performance on selecting the best proof among 6 generated proofs from OPC pass@n subset. We vary the number of judgements from 1 to 45 and observe that the performance saturates at around 32 judgements.