Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training

Xinyuan Zhou; Yi Lei; Xiaoyu Zhou; Jingyi Sun; Yu Zhu; Zhongyi Ye; Weitai Zhang; Quan Liu; Si Wei; Cong Liu

Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training

Xinyuan Zhou, Yi Lei, Xiaoyu Zhou, Jingyi Sun, Yu Zhu, Zhongyi Ye, Weitai Zhang, Quan Liu, Si Wei, Cong Liu

TL;DR

Spark-Prover-X1-7B introduces a three-stage training pipeline (continuous pre-training, supervised fine-tuning with expert iteration, and targeted reinforcement learning) to enhance formal theorem proving in a lightweight 7B model. It pairs a formalizer with a prover and leverages novel data augmentation, CoT-augmented state prediction, and sub-goal decomposition to broaden reasoning coverage. The approach achieves state-of-the-art performance among open-source, similarly sized models across multiple benchmarks, including ExamFormal-Bench and PutnamBench, and demonstrates strong capabilities in auto-formalization and formal proof generation. By releasing the models and ExamFormal-Bench, the work provides a practical path for boosting formal reasoning in lightweight LLMs through diverse training data and staged refinement.

Abstract

Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover achieves state-of-the-art performance among similarly-sized open-source models within the "Whole-Proof Generation" paradigm. It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. We will release both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, in the near future.

Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training

TL;DR

Abstract

Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)