Table of Contents
Fetching ...

REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning

Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He, Pu Yang, Mengzhou Sun, Haocheng Ju, Peihao Wu, Bryan Dai, Bin Dong

TL;DR

REAL-Prover addresses the challenge of scaling formal mathematical reasoning to graduate-level content by introducing a retrieval-augmented, stepwise prover for Lean 4. It combines HERALD-AF for scalable formalization, LeanSearch-PS for semantic premise retrieval, and Jixia-Interactive to enable efficient training and interaction, with an expert-iteration loop for continual improvement. Evaluations on ProofNet and the new FATE-M benchmark show competitive and state-of-the-art results (23.7% Pass@64 on ProofNet; 56.7% Pass@64 on FATE-M), while MiniF2F reveals gaps on Olympiad-like problems and highlights the role of retrieval and RL in closing them. The work provides new data resources (State-Tactic Pair Dataset) and benchmarks (FATE-M), underscoring the practical potential of retrieval-augmented formal reasoning and outlining future directions toward reinforcement learning and informal reasoning layers to boost robustness and transparency.

Abstract

Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).

REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning

TL;DR

REAL-Prover addresses the challenge of scaling formal mathematical reasoning to graduate-level content by introducing a retrieval-augmented, stepwise prover for Lean 4. It combines HERALD-AF for scalable formalization, LeanSearch-PS for semantic premise retrieval, and Jixia-Interactive to enable efficient training and interaction, with an expert-iteration loop for continual improvement. Evaluations on ProofNet and the new FATE-M benchmark show competitive and state-of-the-art results (23.7% Pass@64 on ProofNet; 56.7% Pass@64 on FATE-M), while MiniF2F reveals gaps on Olympiad-like problems and highlights the role of retrieval and RL in closing them. The work provides new data resources (State-Tactic Pair Dataset) and benchmarks (FATE-M), underscoring the practical potential of retrieval-augmented formal reasoning and outlining future directions toward reinforcement learning and informal reasoning layers to boost robustness and transparency.

Abstract

Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).

Paper Structure

This paper contains 32 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of the REAL-Prover model training pipeline. The figure illustrates two components: (a) the HERALD-AF pipeline, which translates informal mathematical statements into formal ones; (b) the expert iteration pipeline, which iteratively refines the prover model.
  • Figure 2: Overview of REAL-Prover. This figure illustrates three main processes: From (a) to (b), the prover passes several tactics to the Lean 4 interactive environment (Jixia-interactive) and receives corresponding scored state nodes. From (b) to (c), the prover selects a state node based on its score and checks whether the proof is complete. From (c) to (a), the prover using LLM with retrieval information from LeanSearch-PS generate new tactics.
  • Figure 3: Proof using LeanSearch-PS and retrieval results. The left shows part of our input prompt; the right displays the proof generated by REAL-Prover.
  • Figure 4: Compare the proofs with and without LeanSearch-PS. The prover in (a) uses the existing instance 'IsAlgClosed.instInfinite' from Mathlib, resulting in a more readable proof.
  • Figure 5: Compare the proofs with and without LeanSearch-PS. The prover in (a) uses the existing instance 'IsPGroup.of_surjective' from Mathlib, resulting in a more readable proof.