REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning
Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He, Pu Yang, Mengzhou Sun, Haocheng Ju, Peihao Wu, Bryan Dai, Bin Dong
TL;DR
REAL-Prover addresses the challenge of scaling formal mathematical reasoning to graduate-level content by introducing a retrieval-augmented, stepwise prover for Lean 4. It combines HERALD-AF for scalable formalization, LeanSearch-PS for semantic premise retrieval, and Jixia-Interactive to enable efficient training and interaction, with an expert-iteration loop for continual improvement. Evaluations on ProofNet and the new FATE-M benchmark show competitive and state-of-the-art results (23.7% Pass@64 on ProofNet; 56.7% Pass@64 on FATE-M), while MiniF2F reveals gaps on Olympiad-like problems and highlights the role of retrieval and RL in closing them. The work provides new data resources (State-Tactic Pair Dataset) and benchmarks (FATE-M), underscoring the practical potential of retrieval-augmented formal reasoning and outlining future directions toward reinforcement learning and informal reasoning layers to boost robustness and transparency.
Abstract
Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).
