Table of Contents
Fetching ...

STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving

Kefan Dong, Tengyu Ma

TL;DR

The paper tackles data scarcity in LLM-driven formal theorem proving by introducing STP, a self-play framework where a conjecturer and a prover train in tandem, with the conjecturer generating conjectures barely provable by the current prover to create dense training signals. Through a carefully designed reward system, Wasserstein-based reweighting to maintain topic diversity, and final re-training, STP achieves substantial improvements over expert iteration, including a 28.5% cumulative proof rate on LeanWorkbook and state-of-the-art performance on miniF2F, ProofNet, and PutnamBench across multiple budgets in Lean and Isabelle. The approach demonstrates that self-generated conjectures can sustainably drive model improvement without additional external data, yielding robust gains on formal verification tasks. These results have practical implications for scalable, autonomous formal reasoning and potential broad applicability to advanced mathematical proof systems.

Abstract

A fundamental challenge in formal theorem proving by LLMs is the lack of high-quality training data. Although reinforcement learning or expert iteration partially mitigates this issue by alternating between LLM generating proofs and finetuning them on correctly generated ones, performance quickly plateaus due to the scarcity of correct proofs (sparse rewards). To keep improving the models with limited data, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them. We design the Self-play Theorem Prover (STP) that simultaneously takes on two roles, conjecturer and prover, each providing training signals to the other. The conjecturer is trained iteratively on previously generated conjectures that are barely provable by the current prover, which incentivizes it to generate increasingly challenging conjectures over time. The prover attempts to prove the conjectures with standard expert iteration. We evaluate STP with both Lean and Isabelle formal versifiers. With 51.3 billion tokens generated during the training in Lean, STP proves 28.5% of the statements in the LeanWorkbook dataset, doubling the previous best result of 13.2% achieved through expert iteration. The final model achieves state-of-the-art performance among whole-proof generation methods on miniF2F-test (65.0%, pass@3200), Proofnet-test (23.9%, pass@3200) and PutnamBench (8/644, pass@3200). We release our code, model, and dataset in this URL: https://github.com/kfdong/STP.

STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving

TL;DR

The paper tackles data scarcity in LLM-driven formal theorem proving by introducing STP, a self-play framework where a conjecturer and a prover train in tandem, with the conjecturer generating conjectures barely provable by the current prover to create dense training signals. Through a carefully designed reward system, Wasserstein-based reweighting to maintain topic diversity, and final re-training, STP achieves substantial improvements over expert iteration, including a 28.5% cumulative proof rate on LeanWorkbook and state-of-the-art performance on miniF2F, ProofNet, and PutnamBench across multiple budgets in Lean and Isabelle. The approach demonstrates that self-generated conjectures can sustainably drive model improvement without additional external data, yielding robust gains on formal verification tasks. These results have practical implications for scalable, autonomous formal reasoning and potential broad applicability to advanced mathematical proof systems.

Abstract

A fundamental challenge in formal theorem proving by LLMs is the lack of high-quality training data. Although reinforcement learning or expert iteration partially mitigates this issue by alternating between LLM generating proofs and finetuning them on correctly generated ones, performance quickly plateaus due to the scarcity of correct proofs (sparse rewards). To keep improving the models with limited data, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them. We design the Self-play Theorem Prover (STP) that simultaneously takes on two roles, conjecturer and prover, each providing training signals to the other. The conjecturer is trained iteratively on previously generated conjectures that are barely provable by the current prover, which incentivizes it to generate increasingly challenging conjectures over time. The prover attempts to prove the conjectures with standard expert iteration. We evaluate STP with both Lean and Isabelle formal versifiers. With 51.3 billion tokens generated during the training in Lean, STP proves 28.5% of the statements in the LeanWorkbook dataset, doubling the previous best result of 13.2% achieved through expert iteration. The final model achieves state-of-the-art performance among whole-proof generation methods on miniF2F-test (65.0%, pass@3200), Proofnet-test (23.9%, pass@3200) and PutnamBench (8/644, pass@3200). We release our code, model, and dataset in this URL: https://github.com/kfdong/STP.

Paper Structure

This paper contains 51 sections, 6 equations, 5 figures, 3 tables, 4 algorithms.

Figures (5)

  • Figure 1: Self-play Theorem Prover (STP). Our model simultaneously takes on two roles --- the conjecturer that generates new, related conjecture given a seed theorem with proof (Step 1), and the prover that attempts to prove the statements in an existing dataset and the generated conjectures (Step 2). Step 4 selects the correct, approachable, elegant, yet challenging conjectures to train the conjecturer, and the verifier selects correct proofs in Step 3 to train the prover. The main difference between STP and expert iteration is the conjecturer role highlighted with a yellow background.
  • Figure 2: The cumulative pass rates of STP, expert iteration, and parallel sampling on LeanWorkbook shows that STP achieves a much better scaling in terms of the performance vs number of generated proofs. The compute for generating conjectures and training the conjecturer in STP is negligible because the number of generated proofs during training is 64 times the number of conjectures.
  • Figure 3: Comparison of pass rates on miniF2F-test (y-axis) with different numbers of inference-time samples (x-axis). The model trained with STP consistently outperforms the DeepSeek-Prover-V1.5 series.
  • Figure 4: Left: Cumulative pass rate on LeanWorkbook (translated into Isabelle) of STP, expert iteration, and parallel sampling, started from two checkpoints in STP training. STP achieves better scaling starting from both checkpoints. For better visualization, the x-axis starts with 50m in this figure, and we defer the full plot to Fig. \ref{['fig:expit-full']} (Right) in Appendix \ref{['app:isabelle-results']}. Middle: The performance of our model on miniF2F gradually improves during the training process. Note that our model is not trained on miniF2F valid and we disallow advanced tactics such as . The checkpoints are taken roughly per 68M generated proofs. Right: Histogram of empirical pass rates of generated conjectures and unproved statements in the training dataset at a checkpoint where the cumulative pass rate on LeanWorkbook (Isabelle translation) is 11.4%. The generated conjectures are significantly more likely to be proved (i.e., has a positive pass rate) than the unproved statements in the dataset, and therefore provide denser training signal. Note that the y-axis is in log scale.
  • Figure 5: Left: Comparison of pass rates between STP, two implementations of expert iteration, and parallel sampling methods on LeanWorkbook. Right: Comparison of pass rates between STP and baseline methods on LeanWorkbook (Isabelle translation). The red crosses shows the points where we refresh the self-play training as described in Section \ref{['sec:implementation']}.