Table of Contents
Fetching ...

Formal Theorem Proving by Rewarding LLMs to Decompose Proofs Hierarchically

Kefan Dong, Arvind Mahankali, Tengyu Ma

TL;DR

This paper designs an RL-based training algorithm that encourages the model to decompose a theorem into lemmas, prove the lemmas, and then prove the theorem by using the lemmas, inspired by how mathematicians train themselves.

Abstract

Mathematical theorem proving is an important testbed for large language models' deep and abstract reasoning capability. This paper focuses on improving LLMs' ability to write proofs in formal languages that permit automated proof verification/evaluation. Most previous results provide human-written lemmas to the theorem prover, which is an arguably oversimplified setting that does not sufficiently test the provers' planning and decomposition capabilities. Instead, we work in a more natural setup where the lemmas that are directly relevant to the theorem are not given to the theorem prover at test time. We design an RL-based training algorithm that encourages the model to decompose a theorem into lemmas, prove the lemmas, and then prove the theorem by using the lemmas. Our reward mechanism is inspired by how mathematicians train themselves: even if a theorem is too challenging to be proved by the current model, a positive reward is still given to the model for any correct and novel lemmas that are proposed and proved in this process. During training, our model proposes and proves lemmas that are not in the training dataset. In fact, these newly-proposed correct lemmas consist of 37.7% of the training replay buffer when we train on the dataset extracted from Archive of Formal Proofs (AFP). The model trained by our RL algorithm outperforms that trained by supervised finetuning, improving the pass rate from 40.8% to 45.5% on AFP test set, and from 36.5% to 39.5% on an out-of-distribution test set.

Formal Theorem Proving by Rewarding LLMs to Decompose Proofs Hierarchically

TL;DR

This paper designs an RL-based training algorithm that encourages the model to decompose a theorem into lemmas, prove the lemmas, and then prove the theorem by using the lemmas, inspired by how mathematicians train themselves.

Abstract

Mathematical theorem proving is an important testbed for large language models' deep and abstract reasoning capability. This paper focuses on improving LLMs' ability to write proofs in formal languages that permit automated proof verification/evaluation. Most previous results provide human-written lemmas to the theorem prover, which is an arguably oversimplified setting that does not sufficiently test the provers' planning and decomposition capabilities. Instead, we work in a more natural setup where the lemmas that are directly relevant to the theorem are not given to the theorem prover at test time. We design an RL-based training algorithm that encourages the model to decompose a theorem into lemmas, prove the lemmas, and then prove the theorem by using the lemmas. Our reward mechanism is inspired by how mathematicians train themselves: even if a theorem is too challenging to be proved by the current model, a positive reward is still given to the model for any correct and novel lemmas that are proposed and proved in this process. During training, our model proposes and proves lemmas that are not in the training dataset. In fact, these newly-proposed correct lemmas consist of 37.7% of the training replay buffer when we train on the dataset extracted from Archive of Formal Proofs (AFP). The model trained by our RL algorithm outperforms that trained by supervised finetuning, improving the pass rate from 40.8% to 45.5% on AFP test set, and from 36.5% to 39.5% on an out-of-distribution test set.

Paper Structure

This paper contains 42 sections, 2 equations, 6 figures, 3 tables, 2 algorithms.

Figures (6)

  • Figure 1: Left: in the dashed callout block, we show an example of an Isabelle proof and its explanation in natural language. Right: an example of a proof tree. The two child nodes correspond to the two new lemmas proposed in the proof of the root node.
  • Figure 2: Illustration of our algorithm Proof Decomposer, ProD-RL. In step 2b, the statement is locally correct if it is proved correctly using the proposed lemmas, and it is globally correct if all the proposed lemmas are also proved correctly. As an important feature of our algorithm, even if a theorem (Theorem 1) is not proved by the model because some lemmas (Lemmas 1) are not proved, we still train on the correct lemma (Lemma 2) by setting its reward $r=1$.
  • Figure 3: The pass rate of different models on AFP test (Left) and AFP 2023 (Right) test sets. Our RL model improves upon the SFT model whereas without proposing new lemmas (RL w/o lemma proposal), we do not observe any improvement.
  • Figure 4: The pass rate of theorems in AFP test grouped by the depth of their ground-truth proof. Grey bars represent the proof generated by the model SFT w/o lemma proposal, and the colored bars represent the proof trees generated by ProD-RL with various depths.
  • Figure 5: Pass rate of the SFT model without lemma proposal tested with different sampling temperature. We observe that lower temperature leads to better performance with 1 sample per theorem, and mildly larger temperature have better performance with more samples.
  • ...and 1 more figures