Table of Contents
Fetching ...

The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs

Jasper Dekoninck, Ivo Petrov, Kristian Minchev, Mislav Balunovic, Martin Vechev, Miroslav Marinov, Maria Drencheva, Lyuba Konova, Milen Shumanov, Kaloyan Tsvetkov, Nikolay Drenchev, Lazar Todorov, Kalina Nikolova, Nikolay Georgiev, Vanesa Kalinkova, Margulan Ismoldayev

TL;DR

The Open Proof Corpus addresses a critical gap in evaluating LLM-generated mathematical proofs by providing a large, human-validated dataset of over 5,000 proofs across more than 1,000 problems, sourced from high-quality competitions. It demonstrates that NL proofs currently outperform formal proofs, reveals a sizable gap between final-answer accuracy and proof validity, and shows that best-of-n and ranking-based strategies can substantially improve proof quality. The study also shows LLMs can judge proofs at near-human levels and reports an open 8B model achieving comparable judging accuracy to the best closed models. Together, these results establish OPC as a practical, scalable resource for training, evaluation, and benchmarking of proof-generation systems and point to directions for bridging the formal-natural reasoning gap.

Abstract

In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and enabling a rigorous analysis of proof generation capabilities. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we explore critical questions in automated proof generation: (1) the performance gap between natural language and formal proof generation, (2) the discrepancy between final-answer accuracy and full-proof validity, and (3) the impact of best-of-n selection on proof quality. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that performs on par with the best model, Gemini-2.5-Pro, on the task of evaluating proof correctness.

The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs

TL;DR

The Open Proof Corpus addresses a critical gap in evaluating LLM-generated mathematical proofs by providing a large, human-validated dataset of over 5,000 proofs across more than 1,000 problems, sourced from high-quality competitions. It demonstrates that NL proofs currently outperform formal proofs, reveals a sizable gap between final-answer accuracy and proof validity, and shows that best-of-n and ranking-based strategies can substantially improve proof quality. The study also shows LLMs can judge proofs at near-human levels and reports an open 8B model achieving comparable judging accuracy to the best closed models. Together, these results establish OPC as a practical, scalable resource for training, evaluation, and benchmarking of proof-generation systems and point to directions for bridging the formal-natural reasoning gap.

Abstract

In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and enabling a rigorous analysis of proof generation capabilities. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we explore critical questions in automated proof generation: (1) the performance gap between natural language and formal proof generation, (2) the discrepancy between final-answer accuracy and full-proof validity, and (3) the impact of best-of-n selection on proof quality. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that performs on par with the best model, Gemini-2.5-Pro, on the task of evaluating proof correctness.

Paper Structure

This paper contains 60 sections, 1 equation, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Overview of the OPC and its conclusions. On the left, we show a typical sample from the OPC, including a question taken from a high-quality mathematical competition, a proof generated by an LLM, and a human judgment of the proof's correctness. On the right, we summarize the main conclusions drawn from the OPC.
  • Figure 2: Overview of competitions included in the OPC.
  • Figure 3: Average proof correctness of various models on the OPC. Data is split into two partitions, the first, resp. second, containing only problems answered by all models except DeepSeek-R1, resp. Grok-3-Mini. The discrepancy between the two partitions is due to the inclusion of harder competitions in the second partition.
  • Figure 4: Average proof correctness on the PutnamBench.
  • Figure 5: Comparison of final-answer accuracy and proof correctness on the MathArena subset.
  • ...and 5 more figures