Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection
Sadegh Mahdavi, Branislav Kisacanin, Shubham Toshniwal, Wei Du, Ivan Moshkov, George Armstrong, Renjie Liao, Christos Thrampoulidis, Igor Gitman
TL;DR
This work addresses the challenge of validating mathematical proofs produced by large language models beyond mere final answers. It contrasts two verification paradigms—single-proof judgments by LLMs and tournament-based proof selection (GenSelect)—and introduces a scalable hybrid framework that combines both for improved accuracy and compute efficiency. Through a rigorous evaluation framework across diverse datasets, the authors show that prompt design heavily influences performance, but reinforcement learning can reduce this sensitivity, particularly for proof judgments. The study provides practical guidelines for designing scalable proof-verification and selection systems, revealing that while proof-level metrics can improve, final-answer precision does not necessarily follow, underscoring the need for careful evaluation and potential human oversight in high-stakes mathematics tasks.
Abstract
Large language models have achieved remarkable success on final-answer mathematical problems, largely due to the ease of applying reinforcement learning with verifiable rewards. However, the reasoning underlying these solutions is often flawed. Advancing to rigorous proof-based mathematics requires reliable proof verification capabilities. We begin by analyzing multiple evaluation setups and show that focusing on a single benchmark can lead to brittle or misleading conclusions. To address this, we evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance. We then scale two major generative verification methods (GenSelect and LLM-as-a-Judge) to millions of tokens and identify their combination as the most effective framework for solution verification and selection. We further show that the choice of prompt for LLM-as-a-Judge significantly affects the model's performance, but reinforcement learning can reduce this sensitivity. However, despite improving proof-level metrics, reinforcement learning does not enhance final-answer precision, indicating that current models often reward stylistic or procedural correctness rather than mathematical validity. Our results establish practical guidelines for designing and evaluating scalable proof-verification and selection systems.
