Table of Contents
Fetching ...

Leanabell-Prover: Posttraining Scaling in Formal Reasoning

Jingyuan Zhang, Qi Wang, Xingguang Ji, Yahui Liu, Yang Yue, Fuzheng Zhang, Di Zhang, Guorui Zhou, Kun Gai

TL;DR

This work investigates posttraining scaling for formal reasoning in Lean 4 by combining large-scale continual training on hybrid statement-proof data with cognitive-behavioral data and GRPO-based reinforcement learning guided by Lean verifier rewards. It introduces Leanabell-Prover, built from SFT baselines, augmented with Lean Completion and Rewriting to instill self-reflection, and then refined via RL to optimize whole-proof generation. The approach yields state-of-the-art MiniF2F performance (59.8% pass@32) on Goedel-Prover-SFT and related baselines, demonstrating meaningful gains from data-centric and cognitive strategies, though challenges remain with weak base LLMs and RL stability. The work provides reproducible data, model checkpoints, and analysis of exploration dynamics, outlining future directions to bridge natural-language reasoning with formal proof generation. Overall, it advances the frontier of formal reasoning with LLMs, offering scalable data pipelines and RL-based improvements for Lean4 theorem proving.

Abstract

Recent advances in automated theorem proving (ATP) through LLMs have highlighted the potential of formal reasoning with Lean 4 codes. However, ATP has not yet be revolutionized by the recent posttraining scaling as demonstrated by Open AI O1/O3 and Deepseek R1. In this work, we investigate the entire posttraining of ATP, aiming to align it with breakthroughs in reasoning models in natural languages. To begin, we continual train current ATP models with a hybrid dataset, which consists of numerous statement-proof pairs, and additional data aimed at incorporating cognitive behaviors that emulate human reasoning and hypothesis refinement. Next, we explore reinforcement learning with the use of outcome reward returned by Lean 4 compiler. Through our designed continual training and reinforcement learning processes, we have successfully improved existing formal provers, including both DeepSeek-Prover-v1.5 and Goedel-Prover, achieving state-of-the-art performance in the field of whole-proof generation. For example, we achieve a 59.8% pass rate (pass@32) on MiniF2F. This is an on-going project and we will progressively update our findings, release our data and training details.

Leanabell-Prover: Posttraining Scaling in Formal Reasoning

TL;DR

This work investigates posttraining scaling for formal reasoning in Lean 4 by combining large-scale continual training on hybrid statement-proof data with cognitive-behavioral data and GRPO-based reinforcement learning guided by Lean verifier rewards. It introduces Leanabell-Prover, built from SFT baselines, augmented with Lean Completion and Rewriting to instill self-reflection, and then refined via RL to optimize whole-proof generation. The approach yields state-of-the-art MiniF2F performance (59.8% pass@32) on Goedel-Prover-SFT and related baselines, demonstrating meaningful gains from data-centric and cognitive strategies, though challenges remain with weak base LLMs and RL stability. The work provides reproducible data, model checkpoints, and analysis of exploration dynamics, outlining future directions to bridge natural-language reasoning with formal proof generation. Overall, it advances the frontier of formal reasoning with LLMs, offering scalable data pipelines and RL-based improvements for Lean4 theorem proving.

Abstract

Recent advances in automated theorem proving (ATP) through LLMs have highlighted the potential of formal reasoning with Lean 4 codes. However, ATP has not yet be revolutionized by the recent posttraining scaling as demonstrated by Open AI O1/O3 and Deepseek R1. In this work, we investigate the entire posttraining of ATP, aiming to align it with breakthroughs in reasoning models in natural languages. To begin, we continual train current ATP models with a hybrid dataset, which consists of numerous statement-proof pairs, and additional data aimed at incorporating cognitive behaviors that emulate human reasoning and hypothesis refinement. Next, we explore reinforcement learning with the use of outcome reward returned by Lean 4 compiler. Through our designed continual training and reinforcement learning processes, we have successfully improved existing formal provers, including both DeepSeek-Prover-v1.5 and Goedel-Prover, achieving state-of-the-art performance in the field of whole-proof generation. For example, we achieve a 59.8% pass rate (pass@32) on MiniF2F. This is an on-going project and we will progressively update our findings, release our data and training details.

Paper Structure

This paper contains 30 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Benchmark performance on MiniF2F-test zheng2021minif2f. Our method boosts both the two baseline models after employing RL training. Goedel-Prover-RL is our implementation. Our framework surpasses DeepSeek-Prover-v1.5-RL and Goedel-Prover-SFT 6.6% and 2.2%, respectively.
  • Figure 2: Distributions of math domains in various Lean 4 dataset. Lean Workbook, Goedel-Prover, STP Lean and NuminaMath are training set. MiniF2F and ProofNet are test set.
  • Figure 3: Exploration ability: pass@16 measures how well base models explore.
  • Figure 4: Left: Reward curve during training Leanabell-Prover-Prover-DS-RL. Right: Reward curve during training Leanabell-Prover-Prover-GD-RL.
  • Figure 5: Distribution of problem types that failed verification on the MiniF2F-test set.
  • ...and 1 more figures