Table of Contents
Fetching ...

Leanabell-Prover-V2: Verifier-integrated Reasoning for Formal Theorem Proving via Reinforcement Learning

Xingguang Ji, Yahui Liu, Qi Wang, Jingyuan Zhang, Yang Yue, Rui Shi, Chenxi Sun, Fuzheng Zhang, Guorui Zhou, Kun Gai

TL;DR

This work tackles the challenge of enabling small LLMs to generate formal proofs by tightly integrating a Lean 4 verifier into a multi-turn reasoning loop. The authors introduce Leanabell-Prover-V2, a 7B model that employs verifier-guided long chain-of-thought and reinforcement learning (via DAPO) to refine proof steps, with feedback from the Lean 4 verifier guiding rewrites and policy updates. Through cold-start data design, verifier-aware RL, and data from external formalization tools, the approach achieves state-of-the-art-like performance on MiniF2F among 7B models and shows notable gains on ProofNet and ProverBench benchmarks, including AIME-style problems. The results demonstrate improved sample efficiency and the model’s capacity to reflect on and correct its reasoning in the presence of verifier feedback, suggesting practical pathways for scalable, reliable formal reasoning in smaller LLMs.

Abstract

We introduce our Leanabell-Prover-V2, a 7B large language models (LLMs) that can produce formal theorem proofs in Lean 4, with verifier-integrated Long Chain-of-Thoughts (CoT). Following our previous work Leanabell-Prover-V1, we continual to choose to posttrain existing strong prover models for further performance improvement. In our V2 version, we mainly upgrade the Reinforcement Learning (RL) with feedback provided by the Lean 4 verifier. Crucially, verifier feedback, such as indicating success or detailing specific errors, allows the LLM to become ``self-aware'' of the correctness of its own reasoning process and learn to reflexively correct errors. Leanabell-Prover-V2 directly optimizes LLM reasoning trajectories with multi-turn verifier interactions, together with feedback token masking for stable RL training and a simple reward strategy. Experiments show that Leanabell-Prover-V2 improves performance by 3.2% (pass@128) with Kimina-Prover-Preview-Distill-7B and 2.0% (pass@128) with DeepSeek-Prover-V2-7B on the MiniF2F test set. The source codes, curated data and models are available at: https://github.com/Leanabell-LM/Leanabell-Prover-V2.

Leanabell-Prover-V2: Verifier-integrated Reasoning for Formal Theorem Proving via Reinforcement Learning

TL;DR

This work tackles the challenge of enabling small LLMs to generate formal proofs by tightly integrating a Lean 4 verifier into a multi-turn reasoning loop. The authors introduce Leanabell-Prover-V2, a 7B model that employs verifier-guided long chain-of-thought and reinforcement learning (via DAPO) to refine proof steps, with feedback from the Lean 4 verifier guiding rewrites and policy updates. Through cold-start data design, verifier-aware RL, and data from external formalization tools, the approach achieves state-of-the-art-like performance on MiniF2F among 7B models and shows notable gains on ProofNet and ProverBench benchmarks, including AIME-style problems. The results demonstrate improved sample efficiency and the model’s capacity to reflect on and correct its reasoning in the presence of verifier feedback, suggesting practical pathways for scalable, reliable formal reasoning in smaller LLMs.

Abstract

We introduce our Leanabell-Prover-V2, a 7B large language models (LLMs) that can produce formal theorem proofs in Lean 4, with verifier-integrated Long Chain-of-Thoughts (CoT). Following our previous work Leanabell-Prover-V1, we continual to choose to posttrain existing strong prover models for further performance improvement. In our V2 version, we mainly upgrade the Reinforcement Learning (RL) with feedback provided by the Lean 4 verifier. Crucially, verifier feedback, such as indicating success or detailing specific errors, allows the LLM to become ``self-aware'' of the correctness of its own reasoning process and learn to reflexively correct errors. Leanabell-Prover-V2 directly optimizes LLM reasoning trajectories with multi-turn verifier interactions, together with feedback token masking for stable RL training and a simple reward strategy. Experiments show that Leanabell-Prover-V2 improves performance by 3.2% (pass@128) with Kimina-Prover-Preview-Distill-7B and 2.0% (pass@128) with DeepSeek-Prover-V2-7B on the MiniF2F test set. The source codes, curated data and models are available at: https://github.com/Leanabell-LM/Leanabell-Prover-V2.

Paper Structure

This paper contains 14 sections, 5 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Benchmark performance on MiniF2F-test zheng2021minif2f. Our method boosts both the two baseline models after employing RL training. Our framework surpasses Kimina-Prover-Preview-Distill-7B and DeepSeek-Prover-V2-7B 3.2% and 2.0% at pass@128, respectively.
  • Figure 2: Prompt for verifier-integrated theorem proving, where we encourage the model employ Lean 4 verifier to verify the correctness of the generated proofs.
  • Figure 3: Overview of four scenarios to cold start data synthesis.
  • Figure 4: Overview of our proposed verifier-integrated theorem proving framework.
  • Figure 5: Several core records in the RL training with DeepSeek-Prover-V2-7B model, including Response Length, Reward Score, Verifier Use Rate, Solve All Ratio, Test Score avg@32, and Entropy.
  • ...and 8 more figures