Table of Contents
Fetching ...

Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training

Xinyuan Zhou, Yi Lei, Xiaoyu Zhou, Jingyi Sun, Yu Zhu, Zhongyi Ye, Weitai Zhang, Quan Liu, Si Wei, Cong Liu

TL;DR

Spark-Prover-X1-7B introduces a three-stage training pipeline (continuous pre-training, supervised fine-tuning with expert iteration, and targeted reinforcement learning) to enhance formal theorem proving in a lightweight 7B model. It pairs a formalizer with a prover and leverages novel data augmentation, CoT-augmented state prediction, and sub-goal decomposition to broaden reasoning coverage. The approach achieves state-of-the-art performance among open-source, similarly sized models across multiple benchmarks, including ExamFormal-Bench and PutnamBench, and demonstrates strong capabilities in auto-formalization and formal proof generation. By releasing the models and ExamFormal-Bench, the work provides a practical path for boosting formal reasoning in lightweight LLMs through diverse training data and staged refinement.

Abstract

Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover achieves state-of-the-art performance among similarly-sized open-source models within the "Whole-Proof Generation" paradigm. It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. We will release both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, in the near future.

Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training

TL;DR

Spark-Prover-X1-7B introduces a three-stage training pipeline (continuous pre-training, supervised fine-tuning with expert iteration, and targeted reinforcement learning) to enhance formal theorem proving in a lightweight 7B model. It pairs a formalizer with a prover and leverages novel data augmentation, CoT-augmented state prediction, and sub-goal decomposition to broaden reasoning coverage. The approach achieves state-of-the-art performance among open-source, similarly sized models across multiple benchmarks, including ExamFormal-Bench and PutnamBench, and demonstrates strong capabilities in auto-formalization and formal proof generation. By releasing the models and ExamFormal-Bench, the work provides a practical path for boosting formal reasoning in lightweight LLMs through diverse training data and staged refinement.

Abstract

Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover achieves state-of-the-art performance among similarly-sized open-source models within the "Whole-Proof Generation" paradigm. It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. We will release both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, in the near future.

Paper Structure

This paper contains 40 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Performance comparison of formalizer and prover models on different datasets.
  • Figure 2: Overview of pretrain process in Spark-Prover-X1-7B. Process (a) is the collection of open-source data using multiple formal language towards multiple tasks. Process (b) is the fine-grained data augmentation by state prediction with CoT. Process (c) is data synthesis orient auto-formalization and formal thoerem proving, where "NLS" stands for natural language statement, "FLS" means formal language statement, "NLP" is natural language proof, and "FLP" is formal language proof. Process (d) is the subgoal decomposition with verification for the unsolved problems.
  • Figure 3: Training framework.
  • Figure 4: Evolution of output length distribution.