CuDIP: Enhancing Theorem Proving in LLMs via Curriculum Learning-based Direct Preference Optimization
Shuming Shi, Ruobing Zuo, Gaolei He, Jianlin Wang, Chenyang Xu, Zhengfeng Yang
TL;DR
This work addresses the mismatch between LLM-based automated theorem proving and human preferences by introducing CuDIP, a curriculum learning-based Direct Preference Optimization framework. CuDIP automatically constructs high-quality, diverse preference data using fine-grained scoring by LLMs and existing theorem-proving data, reducing reliance on manual annotations. It then applies curriculum-based iterative DPO to progressively refine a prover, yielding notable improvements on benchmarks such as MiniF2F and ProofNet (e.g., pass@1 up to 38.5% on MiniF2F-valid). The approach advances automated theorem proving by aligning optimization with human-like preferences while leveraging curriculum structure to steadily tackle harder problems, with strong empirical gains across multiple model configurations.
Abstract
Automated theorem proving (ATP) is one of the most challenging mathematical reasoning tasks for Large Language Models (LLMs). Most existing LLM-based ATP methods rely on supervised fine-tuning, which results in a limited alignment between the theorem proving process and human preferences. Direct Preference Optimization (DPO), which aligns LLMs with human preferences, has shown positive effects for certain tasks. However, the lack of high-quality preference data for theorem proving presents a significant challenge. In this paper, we innovatively apply DPO to formal automated theorem proving and introduces a Curriculum Learning-based DPO Iterative Theorem Proving (CuDIP) method. Specifically, we propose a method for constructing preference data which utilizes LLMs and existing theorem proving data to enhance the diversity of the preference data while reducing the reliance on human preference annotations. We then integrate this preference data construction method with curriculum learning to iteratively fine-tune the theorem proving model through DPO. Experimental results on the MiniF2F and ProofNet datasets demonstrate the effectiveness of the proposed method.
