Leanabell-Prover: Posttraining Scaling in Formal Reasoning
Jingyuan Zhang, Qi Wang, Xingguang Ji, Yahui Liu, Yang Yue, Fuzheng Zhang, Di Zhang, Guorui Zhou, Kun Gai
TL;DR
This work investigates posttraining scaling for formal reasoning in Lean 4 by combining large-scale continual training on hybrid statement-proof data with cognitive-behavioral data and GRPO-based reinforcement learning guided by Lean verifier rewards. It introduces Leanabell-Prover, built from SFT baselines, augmented with Lean Completion and Rewriting to instill self-reflection, and then refined via RL to optimize whole-proof generation. The approach yields state-of-the-art MiniF2F performance (59.8% pass@32) on Goedel-Prover-SFT and related baselines, demonstrating meaningful gains from data-centric and cognitive strategies, though challenges remain with weak base LLMs and RL stability. The work provides reproducible data, model checkpoints, and analysis of exploration dynamics, outlining future directions to bridge natural-language reasoning with formal proof generation. Overall, it advances the frontier of formal reasoning with LLMs, offering scalable data pipelines and RL-based improvements for Lean4 theorem proving.
Abstract
Recent advances in automated theorem proving (ATP) through LLMs have highlighted the potential of formal reasoning with Lean 4 codes. However, ATP has not yet be revolutionized by the recent posttraining scaling as demonstrated by Open AI O1/O3 and Deepseek R1. In this work, we investigate the entire posttraining of ATP, aiming to align it with breakthroughs in reasoning models in natural languages. To begin, we continual train current ATP models with a hybrid dataset, which consists of numerous statement-proof pairs, and additional data aimed at incorporating cognitive behaviors that emulate human reasoning and hypothesis refinement. Next, we explore reinforcement learning with the use of outcome reward returned by Lean 4 compiler. Through our designed continual training and reinforcement learning processes, we have successfully improved existing formal provers, including both DeepSeek-Prover-v1.5 and Goedel-Prover, achieving state-of-the-art performance in the field of whole-proof generation. For example, we achieve a 59.8% pass rate (pass@32) on MiniF2F. This is an on-going project and we will progressively update our findings, release our data and training details.
