Table of Contents
Fetching ...

FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

Nikil Ravi, Kexing Ying, Vasilii Nesterov, Rayan Krishnan, Elif Uskuplu, Bingyu Xia, Janitha Aswedige, Langston Nashold

Abstract

We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced undergraduate and graduate mathematics, with problems drawn from qualifying exams and standard textbooks across topics including analysis, algebra, probability, and logic. We evaluate a range of frontier models with an agentic harness, and find that the best-performing foundation model achieves 33.5% accuracy, with performance dropping rapidly after that. In addition to the accuracy numbers, we also provide empirical analysis of tool-use, failure modes, cost and latency, thereby providing a thorough evaluation of the formal-theorem proving abilities of frontier models.

FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

Abstract

We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced undergraduate and graduate mathematics, with problems drawn from qualifying exams and standard textbooks across topics including analysis, algebra, probability, and logic. We evaluate a range of frontier models with an agentic harness, and find that the best-performing foundation model achieves 33.5% accuracy, with performance dropping rapidly after that. In addition to the accuracy numbers, we also provide empirical analysis of tool-use, failure modes, cost and latency, thereby providing a thorough evaluation of the formal-theorem proving abilities of frontier models.

Paper Structure

This paper contains 27 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Evaluation pipeline for FormalProofBench. Models are given the informal problem statement, along with the formalized statement (+ necessary headers and context), and have 40 turns (with each turn potentially involving multiple tool calls and reasoning), to submit a proof.
  • Figure 2: System prompt used for all model evaluations.
  • Figure 3: User prompt template. Variables in braces are filled per-problem. Tool guidance is appended via {tool_guidance}. See Figure \ref{['fig:tool-guidance']} for the full tool guidance prompt.
  • Figure 4: Total tool calls (search vs. code execution) aggregated across all 200 problems per model. Models vary widely in their tool-use strategies.
  • Figure 5: We observe a positive but modest correlation between the number of code execution calls and accuracy across models. Notable outliers include GPT 5, which achieves relatively high accuracy with very few code execution calls, and Grok 4.1 Fast, which executes substantial code but achieves low accuracy.
  • ...and 1 more figures