LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Jianing Wang; Jianfei Zhang; Qi Guo; Linsen Guo; Rumei Li; Chao Zhang; Chong Peng; Cunguang Wang; Dengchang Zhao; Jiarong Shi; Jingang Wang; Liulin Feng; Mengxia Shen; Qi Li; Shengnan An; Shun Wang; Wei Shi; Xiangyu Xi; Xiaoyu Li; Xuezhi Cao; Yi Lu; Yunke Zhao; Zhengyu Chen; Zhimin Lin; Wei Wang; Peng Pei; Xunliang Cai

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Jianing Wang, Jianfei Zhang, Qi Guo, Linsen Guo, Rumei Li, Chao Zhang, Chong Peng, Cunguang Wang, Dengchang Zhao, Jiarong Shi, Jingang Wang, Liulin Feng, Mengxia Shen, Qi Li, Shengnan An, Shun Wang, Wei Shi, Xiangyu Xi, Xiaoyu Li, Xuezhi Cao, Yi Lu, Yunke Zhao, Zhengyu Chen, Zhimin Lin, Wei Wang, Peng Pei, Xunliang Cai

Abstract

We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Abstract

Paper Structure (40 sections, 3 equations, 5 figures, 5 tables)

This paper contains 40 sections, 3 equations, 5 figures, 5 tables.

Introduction
Hybrid-Experts Iteration Framework
Native Formal Experts
Auto-Formalization (AF)
Sketching
Proving
Hybrid-experts Tool-Integration Synthesize
Experts Self-Evolving
Approach
Overview
Data Curation
Basic Processing
Difficulty Estimation
Diversity Sampling
Hierarchical Importance Sampling Policy Optimization
...and 25 more sections

Figures (5)

Figure 1: The performance comparison over proving tasks. The figures on the left and middle illustrate the performance with a limited 32 inference budget on MathOlympiad-Bench and PutnamBench, respectively. The figure on the right shows the model performance on MiniF2F-Test balancing with the inference budgets.
Figure 2: The overview of our hybrid-experts tool-integration synthesis pipeline. In this pipeline, we iteratively optimize three different experts (i.e., auto-formalizer, lemma-style sketcher, and prover), and use these experts to synthesize trajectories based on pre-defined formal reasoning capabilities. Given one problem in natural language, we first transform it into a formal statement in Lean4 by interacting with the Lean4 compiler. Then, the formal statement will be used to generate a whole-proof and lemma-style sketch. If the whole-proof still fails to pass verification within a limited number of tool feedback rounds, the proof generated from the lemma-style sketch will be used instead for verification. In the rejection sampling phase, we will retain the trajectory that allows us to fully utilize the tools to perform auto-formalization, sketching, and proving.
Figure 3: The overview of the training pipeline of LongCat-Flash-Prover. We choose the LongCat Mid-train base model as the starting point, and then perform domain-mixed SFT to obtain a cold-start model. Then, we use the new cold-start model to refresh the prompts and trajectories through self-distillation and agentic RL, and repeat this process multiple times. After the iteration, we obtain our LongCat-Flash-Prover model.
Figure 4: RL rollout pass rate (w/ v.s. w/o hacking).
Figure 5: The workflow of agentic lemma tree search.

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Abstract

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Authors

Abstract

Table of Contents

Figures (5)