MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving
Ruida Wang, Rui Pan, Yuxin Li, Jipeng Zhang, Yizhen Jia, Shizhe Diao, Renjie Pi, Junjie Hu, Tong Zhang
TL;DR
MA-LoT tackles the challenge of balancing natural-language planning and formal verification in Lean4 theorem proving by introducing a model-collaboration framework (Prover and Corrector) and a LoT-TL training pipeline that induces Lean4-aligned Long CoT reasoning without annotated data. The approach leverages Long CoT to generate structured proofs and uses Lean4 verifier feedback for iterative correction, achieving a 61.07% accuracy on MiniF2F-Lean4 and outperforming both tree-search and whole-proof baselines. Key contributions include the LoT-TL data pipeline, the dual-model collaboration, and empirical evidence that integrating formal verification with Long CoT yields deeper, more insightful proofs. The work points to broad applicability of Long CoT–enabled formal reasoning beyond Lean4 and suggests a general path for domain-specific reasoning without heavy annotation or RL.
Abstract
Solving mathematical problems using computer-verifiable languages like Lean has significantly impacted the mathematical and computer science communities. State-of-the-art methods utilize a single Large Language Model (LLM) to generate complete proof or perform tree search, but they fail to balance these tasks. We propose **MA-LoT**: *Model-CollAboration Lean-based Long Chain-of-Thought*, a comprehensive framework for Lean4 theorem proving to solve this issue. It separates the cognition tasks of general NL for whole-proof generation and error analysis for proof correction using the model-collaboration method. We achieve this by structured interaction of the LLM and Lean4 verifier in Long CoT. To implement the framework, we propose the novel *LoT-Transfer Learning* training-inference pipeline, which enables the Long CoT thinking capability to LLMs without special data annotation. Extensive experiment shows that our framework achieves a **61.07%** accuracy rate on the Lean4 version of the MiniF2F-Test dataset, largely outperforming DeepSeek-V3 (33.61%), single-model tree search (InternLM-Step-Prover, 50.70%), and whole-proof generation (Godel-Prover, 55.33%) baselines. Furthermore, our findings highlight the potential of combining Long CoT with formal verification for a more insightful generation in a broader perspective.
