REFACTOR: Learning to Extract Theorems from Proofs

Jin Peng Zhou; Yuhuai Wu; Qiyang Li; Roger Grosse

REFACTOR: Learning to Extract Theorems from Proofs

Jin Peng Zhou, Yuhuai Wu, Qiyang Li, Roger Grosse

TL;DR

This work introduces REFACTOR, a neural method that learns to extract reusable theorems from formal proofs by expanding human proofs and training a graph-based model to identify embedded theorems. The approach yields unseen-theorem extraction accuracy of $19.6\%$, discovers hundreds of new theorems, and, when integrated into the Metamath library, enables significant proof refactoring and compression. A neural prover trained on the refactored dataset proves more test theorems (e.g., $+75$) and relies on a diverse set of newly extracted theorems (average usage $31.0\%$ of proved theorems). Overall, REFACTOR demonstrates the viability of data-driven theorem extraction to improve both library quality and automated proving performance, with broad potential for extending to other formal systems and program synthesis.

Abstract

Human mathematicians are often good at recognizing modular and reusable theorems that make complex mathematical results within reach. In this paper, we propose a novel method called theoREm-from-prooF extrACTOR (REFACTOR) for training neural networks to mimic this ability in formal mathematical theorem proving. We show on a set of unseen proofs, REFACTOR is able to extract 19.6% of the theorems that humans would use to write the proofs. When applying the model to the existing Metamath library, REFACTOR extracted 16 new theorems. With newly extracted theorems, we show that the existing proofs in the MetaMath database can be refactored. The new theorems are used very frequently after refactoring, with an average usage of 733.5 times, and help shorten the proof lengths. Lastly, we demonstrate that the prover trained on the new-theorem refactored dataset proves more test theorems and outperforms state-of-the-art baselines by frequently leveraging a diverse set of newly extracted theorems. Code can be found at https://github.com/jinpz/refactor.

REFACTOR: Learning to Extract Theorems from Proofs

TL;DR

, discovers hundreds of new theorems, and, when integrated into the Metamath library, enables significant proof refactoring and compression. A neural prover trained on the refactored dataset proves more test theorems (e.g.,

) and relies on a diverse set of newly extracted theorems (average usage

of proved theorems). Overall, REFACTOR demonstrates the viability of data-driven theorem extraction to improve both library quality and automated proving performance, with broad potential for extending to other formal systems and program synthesis.

Abstract

Paper Structure (27 sections, 2 equations, 8 figures, 6 tables)

This paper contains 27 sections, 2 equations, 8 figures, 6 tables.

Introduction
Related Work
Lemma Extraction
Discovering Reusable Structures
Machine Learning for Theorem Proving
Metamath and Proof Representation
Method
Sub-component of a Proof Tree as a Theorem
Supervised Prediction Task
REFACTOR: Theorem-from-Proof Extractor
Experiments
Dataset and Pre-processing
Model Architecture and Training Protocol
Q1 - How many human-defined theorems does the model extract?
Q2 - Are newly extracted theorems by REFACTOR used frequently?
...and 12 more sections

Figures (8)

Figure 1: (a) and (b): proof tree visualization of theorems a1i and mp1i respectively. Each node of the proof tree contains two pieces of information: N and PROP. N refers to the name of the premise, axiom or theorem applied at this node and PROP is the resultant expression after applying N. Note that in (a), a1i has three hypotheses which are colored darkblue, darkgreen and lightgreen. In (b), proof of mp1i invokes theorem a1i and the three corresponding hypotheses to theorem application are highlighted with the same color. (c): proof tree of mp1i in (b) is expanded by substituting the proof tree of a1i in blue. These blue nodes, $\mathcal{V}_{target}$, are the targets for our proposed learning task.
Figure 2: Number of theorems vs number of occurrences in entire dataset (a) and test set (b). Both (a) and (b) show noticeable occurrence unbalance with (b) being less due to our further subsampling of a maximum 10 occurrence for test set. (c) Distribution of number of nodes in newly extracted theorems. The model mostly extracts short theorems but is also capable of extracting theorems that have hundreds of nodes.
Figure 3: A proof tree prediction where nodes with output probability greater than 0.5 have been colored blue. This proof tree does not satisfy the constraint to be a valid theorem because only one of the parent nodes of the root (imp31) are predicted to be in $\mathcal{V}_{target}$.
Figure 4: Visualization of theorem verification algorithm.
Figure 5: An example prediction that fails to be extracted as a new theorem due to no valid substitution plan in standardization. Specifically, the blue node wi cannot be substituted to a canonical argument allowed in Metamath while still keeping the proof tree valid.
...and 3 more figures

REFACTOR: Learning to Extract Theorems from Proofs

TL;DR

Abstract

REFACTOR: Learning to Extract Theorems from Proofs

Authors

TL;DR

Abstract

Table of Contents

Figures (8)