Table of Contents
Fetching ...

Do We Need Frontier Models to Verify Mathematical Proofs?

Aaditya Naik, Guruprerana Shabadi, Rajeev Alur, Mayur Naik

Abstract

Advances in training, post-training, and inference-time methods have enabled frontier reasoning models to win gold medals in math competitions and settle challenging open problems. Gaining trust in the responses of these models requires that natural language proofs be checked for errors. LLM judges are increasingly being adopted to meet the growing demand for evaluating such proofs. While verification is considered easier than generation, what model capability does reliable verification actually require? We systematically evaluate four open-source and two frontier LLMs on datasets of human-graded natural language proofs of competition-level problems. We consider two key metrics: verifier accuracy and self-consistency (the rate of agreement across repeated judgments on the same proof). We observe that smaller open-source models are only up to ~10% behind frontier models in accuracy but they are up to ~25% more inconsistent. Furthermore, we see that verifier accuracy is sensitive to prompt choice across all models. We then demonstrate that the smaller models, in fact, do possess the mathematical capabilities to verify proofs at the level of frontier models, but they struggle to reliably elicit these capabilities with general judging prompts. Through an LLM-guided prompt search, we synthesize an ensemble of specialized prompts that overcome the specific failure modes of smaller models, boosting their performance by up to 9.1% in accuracy and 15.9% in self-consistency. These gains are realized across models and datasets, allowing models like Qwen3.5-35B to perform on par with frontier models such as Gemini 3.1 Pro for proof verification.

Do We Need Frontier Models to Verify Mathematical Proofs?

Abstract

Advances in training, post-training, and inference-time methods have enabled frontier reasoning models to win gold medals in math competitions and settle challenging open problems. Gaining trust in the responses of these models requires that natural language proofs be checked for errors. LLM judges are increasingly being adopted to meet the growing demand for evaluating such proofs. While verification is considered easier than generation, what model capability does reliable verification actually require? We systematically evaluate four open-source and two frontier LLMs on datasets of human-graded natural language proofs of competition-level problems. We consider two key metrics: verifier accuracy and self-consistency (the rate of agreement across repeated judgments on the same proof). We observe that smaller open-source models are only up to ~10% behind frontier models in accuracy but they are up to ~25% more inconsistent. Furthermore, we see that verifier accuracy is sensitive to prompt choice across all models. We then demonstrate that the smaller models, in fact, do possess the mathematical capabilities to verify proofs at the level of frontier models, but they struggle to reliably elicit these capabilities with general judging prompts. Through an LLM-guided prompt search, we synthesize an ensemble of specialized prompts that overcome the specific failure modes of smaller models, boosting their performance by up to 9.1% in accuracy and 15.9% in self-consistency. These gains are realized across models and datasets, allowing models like Qwen3.5-35B to perform on par with frontier models such as Gemini 3.1 Pro for proof verification.

Paper Structure

This paper contains 30 sections, 11 figures.

Figures (11)

  • Figure 1: (Left) Mean balanced accuracy and self-consistency rates of frontier and open-source LLM judges across all datasets and prompts. Low and high-reasoning settings are denoted with '$\bullet$' and '$\blacklozenge$', respectively, while '$\star$' denotes performance after prompt ensembling. Smaller open-source models are only ${\sim}10\%$ behind frontier models in balanced accuracy, but are substantially less self-consistent. Prompt ensembling closes much of this gap, allowing models such as Qwen3.5-35B to approach frontier-level verifier performance. (Right) Overview of our prompt-ensemble method. A proof is evaluated under a collection of prompts with complementary roles, including general grading prompts and prompts specialized to different aspects of proof correctness. Their judgments are aggregated by threshold voting to produce a final verdict.
  • Figure 2: Comparing mean balanced accuracy of all the models on the combined dataset using the three prompts: IMOBench, GIMO, ProofBench7pt. The mean is computed over three independent runs and the error bars indicate the maximum and minimum balanced accuracies amongst the runs.
  • Figure 3: False positive rates (left) and false negative rates (right) of few models on the combined dataset, compared across the three prompts.
  • Figure 4: Two failure modes of GPT-OSS-120B with IMOBench prompt on IMO-GradingBench and how targeted prompts address them. Each column shows the proof excerpt (top), the IMOBench prompt's reasoning (middle), and a targeted prompt's response (bottom). Excerpts are marked by a thick left bar; red highlights mark errors. Yellow highlights mark fixed judgements. Case 1: the IMOBench prompt accepts the proof's invalid inequality combination without investigating deeper, while \ref{['prompt:logic-chain-verifier']} catches the error. Case 2: the IMOBench prompt invents a core missing justification rather than flagging the reasoning gap as an error, while \ref{['prompt:proof-repair']} identifies the incomplete case analysis.
  • Figure 5: Performance of the prompt ensemble on the three datasets.
  • ...and 6 more figures