Table of Contents
Fetching ...

Lean-STaR: Learning to Interleave Thinking and Proving

Haohan Lin, Zhiqing Sun, Sean Welleck, Yiming Yang

TL;DR

Lean-STaR addresses the gap between informal reasoning and formal Lean proofs by injecting natural-language thoughts before each tactic. It generates retrospective thoughts from human proofs, trains a thought-augmented predictor, and further bootstraps it via expert iteration. On the miniF2F benchmark, Lean-STaR establishes new state-of-the-art results, outperforming prior methods by leveraging thought augmentation and self-improvement. This work demonstrates a scalable approach to enhance formal theorem proving and deepens the integration of informal and formal mathematics.

Abstract

Traditional language model-based theorem proving assumes that by training on a sufficient amount of formal proof data, a model will learn to prove theorems. Our key observation is that a wealth of informal information that is not present in formal proofs can be useful for learning to prove theorems. For instance, humans think through steps of a proof, but this thought process is not visible in the resulting code. We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof, thereby boosting the model's theorem-proving capabilities. Lean-STaR uses retrospective ground-truth tactics to generate synthetic thoughts for training the language model. At inference time, the trained model directly generates the thoughts prior to the prediction of the tactics in each proof step. Building on the self-taught reasoner framework, we then apply expert iteration to further fine-tune the model on the correct proofs it samples and verifies using the Lean solver. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment, significantly outperforming base models ($\boldsymbol{43.4\% \rightarrow 46.3\%,}$ Pass@64). We also analyze the impact of the augmented thoughts on various aspects of the theorem proving process, providing insights into their effectiveness.

Lean-STaR: Learning to Interleave Thinking and Proving

TL;DR

Lean-STaR addresses the gap between informal reasoning and formal Lean proofs by injecting natural-language thoughts before each tactic. It generates retrospective thoughts from human proofs, trains a thought-augmented predictor, and further bootstraps it via expert iteration. On the miniF2F benchmark, Lean-STaR establishes new state-of-the-art results, outperforming prior methods by leveraging thought augmentation and self-improvement. This work demonstrates a scalable approach to enhance formal theorem proving and deepens the integration of informal and formal mathematics.

Abstract

Traditional language model-based theorem proving assumes that by training on a sufficient amount of formal proof data, a model will learn to prove theorems. Our key observation is that a wealth of informal information that is not present in formal proofs can be useful for learning to prove theorems. For instance, humans think through steps of a proof, but this thought process is not visible in the resulting code. We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof, thereby boosting the model's theorem-proving capabilities. Lean-STaR uses retrospective ground-truth tactics to generate synthetic thoughts for training the language model. At inference time, the trained model directly generates the thoughts prior to the prediction of the tactics in each proof step. Building on the self-taught reasoner framework, we then apply expert iteration to further fine-tune the model on the correct proofs it samples and verifies using the Lean solver. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment, significantly outperforming base models ( Pass@64). We also analyze the impact of the augmented thoughts on various aspects of the theorem proving process, providing insights into their effectiveness.
Paper Structure (34 sections, 2 equations, 6 figures, 10 tables)

This paper contains 34 sections, 2 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: The illustration of tactic prediction in one proof step with and without thought.
  • Figure 2: An example of Lean proof and thoughts generated by Lean-STaR. Note that there is a calculation error in the thought (in red), but this does not affect the correctness of the proof because the calculation task is actually completed by the interactive theorem prover (i.e., Lean's nlinarith) instead of the language model. This shows a benefit of combining neural and symbolic systems.
  • Figure 3: The diagram of our pipeline. (1) Produce CoT dataset through GPT-4. (2) Fine-tune the SFT model with the CoT dataset to obtain Lean-CoT. (3) Use expert iteration to generate the STaR dataset through the model in the last iteration (Lean-CoT in the first iteration) and then fine-tune Lean-CoT on the updated STaR dataset to obtain the model in the next iteration. We continue performing this step until a stopping condition is met (e.g., a fixed number of iterations).
  • Figure 4: The visualization of Best-first Search ($K=1$) and Sampling ($S=1$). Search method maintains a search tree and explores $S$ tactics on each expanded node. Sampling method explores $K$ tactic trajectories from the root and ignores illegal tactics in the trajectories.
  • Figure 5: A example proof and its visualization of $n\leq m \Rightarrow n+k \leq m+k$ in Lean, taken from lample2022hypertree. The induction tactic reduces the initial statement to two subgoals. Then tactics case 0:exact $h_0$ and case ih:rw nat.succ_le_succ_iff, case ih:exact k_ih can be applied in turn to complete the proof.
  • ...and 1 more figures