Table of Contents
Fetching ...

LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction

Robert Joseph George, Suozhi Huang, Peiyang Song, Anima Anandkumar

TL;DR

LeanProgress addresses the challenge of guiding search in formal proof development by predicting the number of remaining proof steps from a given Lean proof state. It builds a balanced dataset from Lean Workbook Plus and Mathlib4, then fine-tunes a lightweight model (DeepSeek Coder 1.3B) to predict remaining steps, achieving an overall MAE of 3.15 and accuracy of 75.8% on held-out data when using proof history. Incorporating this progress signal into a best-first search framework with Reprover yields a 3.8% improvement on Mathlib4-pass rates, demonstrating that global progress signals can enhance both automated and interactive theorem proving. The work also integrates a predict_steps_with_suggestion tool into LeanCopilot, offering users immediate feedback on proof progress alongside tactic suggestions, and is released with open-source code for community use and extension.

Abstract

Mathematical reasoning remains a significant challenge for Large Language Models (LLMs) due to hallucinations. When combined with formal proof assistants like Lean, these hallucinations can be eliminated through rigorous verification, making theorem proving reliable. However, even with formal verification, LLMs still struggle with long proofs and complex mathematical formalizations. While Lean with LLMs offers valuable assistance with retrieving lemmas, generating tactics, or even complete proofs, it lacks a crucial capability: providing a sense of proof progress. This limitation particularly impacts the overall development efficiency in large formalization projects. We introduce LeanProgress, a method that predicts the progress in the proof. Training and evaluating our models made on a large corpus of Lean proofs from Lean Workbook Plus and Mathlib4 and how many steps remain to complete it, we employ data preprocessing and balancing techniques to handle the skewed distribution of proof lengths. Our experiments show that LeanProgress achieves an overall prediction accuracy of 75.8% in predicting the amount of progress and, hence, the remaining number of steps. When integrated into a best-first search framework using Reprover, our method shows a 3.8% improvement on Mathlib4 compared to baseline performances of 41.4%, particularly for longer proofs. These results demonstrate how proof progress prediction can enhance both automated and interactive theorem proving, enabling users to make more informed decisions about proof strategies. Our code is merged in this library here https://github.com/lean-dojo/LeanDojo-v2.

LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction

TL;DR

LeanProgress addresses the challenge of guiding search in formal proof development by predicting the number of remaining proof steps from a given Lean proof state. It builds a balanced dataset from Lean Workbook Plus and Mathlib4, then fine-tunes a lightweight model (DeepSeek Coder 1.3B) to predict remaining steps, achieving an overall MAE of 3.15 and accuracy of 75.8% on held-out data when using proof history. Incorporating this progress signal into a best-first search framework with Reprover yields a 3.8% improvement on Mathlib4-pass rates, demonstrating that global progress signals can enhance both automated and interactive theorem proving. The work also integrates a predict_steps_with_suggestion tool into LeanCopilot, offering users immediate feedback on proof progress alongside tactic suggestions, and is released with open-source code for community use and extension.

Abstract

Mathematical reasoning remains a significant challenge for Large Language Models (LLMs) due to hallucinations. When combined with formal proof assistants like Lean, these hallucinations can be eliminated through rigorous verification, making theorem proving reliable. However, even with formal verification, LLMs still struggle with long proofs and complex mathematical formalizations. While Lean with LLMs offers valuable assistance with retrieving lemmas, generating tactics, or even complete proofs, it lacks a crucial capability: providing a sense of proof progress. This limitation particularly impacts the overall development efficiency in large formalization projects. We introduce LeanProgress, a method that predicts the progress in the proof. Training and evaluating our models made on a large corpus of Lean proofs from Lean Workbook Plus and Mathlib4 and how many steps remain to complete it, we employ data preprocessing and balancing techniques to handle the skewed distribution of proof lengths. Our experiments show that LeanProgress achieves an overall prediction accuracy of 75.8% in predicting the amount of progress and, hence, the remaining number of steps. When integrated into a best-first search framework using Reprover, our method shows a 3.8% improvement on Mathlib4 compared to baseline performances of 41.4%, particularly for longer proofs. These results demonstrate how proof progress prediction can enhance both automated and interactive theorem proving, enabling users to make more informed decisions about proof strategies. Our code is merged in this library here https://github.com/lean-dojo/LeanDojo-v2.

Paper Structure

This paper contains 23 sections, 1 equation, 9 figures, 7 tables.

Figures (9)

  • Figure 1: The visualization of LeanProgress. LeanProgress is a lightweight framework that collects the number of remaining steps in proof trees and then balances the data distribution to train the language model. Then LeanProgress takes the proof state as input to generate the remaining steps for each state as a signal for search. LeanProgress also integrates the tactic predict_steps in LeanCopilot as a user-friendly tool.
  • Figure 2: The visualization of extract proof tree in theorem proving.
  • Figure 3: Proof‐length distributions before and after length balancing. Panel (a) plots all collected states prior to balancing (dominated by 1–2-step proofs; average length $L_{\text{original}}=2.47$). Panel (b) shows the training set after balancing, where we down-sample short proofs and (lightly) up-sample longer ones, yielding fewer total states and a higher average length $L_{\text{balanced}}=10.1$. Bar heights are raw counts, so the totals differ and the $y$-axes are not directly comparable.
  • Figure 4: Qualitative Example 1/6 of running our Step Predictor on real-world Lean problems.
  • Figure 5: Qualitative Example 2/6 of running our Step Predictor on real-world Lean problems.
  • ...and 4 more figures