Proving Theorems Recursively

Haiming Wang; Huajian Xin; Zhengying Liu; Wenda Li; Yinya Huang; Jianqiao Lu; Zhicheng Yang; Jing Tang; Jian Yin; Zhenguo Li; Xiaodan Liang

Proving Theorems Recursively

Haiming Wang, Huajian Xin, Zhengying Liu, Wenda Li, Yinya Huang, Jianqiao Lu, Zhicheng Yang, Jing Tang, Jian Yin, Zhenguo Li, Xiaodan Liang

TL;DR

This paper introduces POETRY, a recursive proving framework that builds verifiable proof sketches level by level in Isabelle, deferring detailed subproofs with a sorry placeholder. By combining recursive data construction with a novel recursive BFS, it avoids myopic step-by-step search and enables longer, more complex proofs. Empirical results on miniF2F and PISA show a 5.1% absolute improvement in pass@1 over prior methods and significantly longer maximum proofs, validating the approach. The method is extensible to other formal environments and offers a new direction for scalable, reliable automated theorem proving.

Abstract

Recent advances in automated theorem proving leverages language models to explore expanded search spaces by step-by-step proof generation. However, such approaches are usually based on short-sighted heuristics (e.g., log probability or value function scores) that potentially lead to suboptimal or even distracting subgoals, preventing us from finding longer proofs. To address this challenge, we propose POETRY (PrOvE Theorems RecursivelY), which proves theorems in a recursive, level-by-level manner in the Isabelle theorem prover. Unlike previous step-by-step methods, POETRY searches for a verifiable sketch of the proof at each level and focuses on solving the current level's theorem or conjecture. Detailed proofs of intermediate conjectures within the sketch are temporarily replaced by a placeholder tactic called sorry, deferring their proofs to subsequent levels. This approach allows the theorem to be tackled incrementally by outlining the overall theorem at the first level and then solving the intermediate conjectures at deeper levels. Experiments are conducted on the miniF2F and PISA datasets and significant performance gains are observed in our POETRY approach over state-of-the-art methods. POETRY on miniF2F achieves an average proving success rate improvement of 5.1%. Moreover, we observe a substantial increase in the maximum proof length found by POETRY, from 10 to 26.

Proving Theorems Recursively

TL;DR

Abstract

Paper Structure (22 sections, 1 equation, 5 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 1 equation, 5 figures, 3 tables, 1 algorithm.

Introduction
Preliminary
Formal Mathematic Enviroments
Search-Based Neural Theorem Proving
Methodology
Recursive Data Construction
Recursive Best-First Search
Experiments
Experimental Setup
Main Results
Analysis
Related Works
Limitations
Conclusion
More Details on POETRY
...and 7 more sections

Figures (5)

Figure 1: Comparison between the step-by-step proof and the recursive proof. (a) A step-by-step proving approach ignores the hierarchical structure inherent in the proof, treating it merely as a sequence of proof steps. The proof cannot be verified as valid until it is fully complete. (b) The recursive proving method decomposes the structured proof into different levels of verifiable proof sketches. Each proof sketch attempts to prove the target theorem or conjecture by outlining the primary steps at the current level and postponing the proof of intermediate conjectures to the next level.
Figure 2: A walkthrough example of recursive BFS. Each node in the proof tree is a proof state and each edge is a proof step. (a) The proof search begins by finding the proof sketch at the first level using BFS. The search is paused upon identifying a successful proof path, marked with P and HP nodes. This proof path contains a sorry edge, indicating that it includes skipped conjectures or subgoals that must be addressed in the next level. (b) Recursive BFS enters the next level of proof search to attempt to prove the skipped subgoal from the first level. Unfortunately, the proof search for this subgoal fails due to a lack of valid nodes to explore, and the search returns to the first level. (c) After the failed attempt to prove the subgoal, the previously established proof path at the first level becomes invalid. Consequently, we backpropagate the failure from the second level's root node up to the first-level root node, updating all the HP nodes to an O node. (d) At the first level, with the status set to open for searching proofs, we continued to explore new proof paths. Fortunately, we discovered another proof path. However, this path also contained a sorry edge with a skipped conjecture that needs to be proved at the next level. (e) Similar to (b), the recursive BFS proceeds to the next level to search for a proof for the previously skipped conjecture. It successfully finds a proof path without any "sorry" edges (denoted as P nodes), indicating that the conjecture has been proven successfully without any skipped intermediate conjectures or subgoals in the proof path. (f) After finding the sub-level proof, the recursive BFS returns to the first level and backpropagates the PROVED message to the root, completing the proof.
Figure 3: Proof length comparison between POETRY and GPT-f Baseline. The y-axis is shown in the log scale. (a) Proof length's histogram of found proof in miniF2F dataset. most of the proof found is within 3 steps long, especially for GPT-f Baselines, but POETRY managed to find longer proof up to 18 proof steps in one proof. (b) Proof length's histogram of found proof in the PISA dataset. POETRY discovers a lot more proofs with longer proof lengths.
Figure 4: Case comparison between POETRY and GPT-f Baseline. (a) Recursive proof found by POETRY in $\textbf{71.2}$ seconds, the proof contains two proof levels. (b) Failure-proof paths found by the GPT-f Baseline. GPT-f Baseline failed to find proof due to timeout after $\textbf{600}$ seconds. We select two different failure proof paths found by GPT-f Baseline.
Figure 5: Distribution of proof level and proof length in PISA dataset. (a) Histogram of proof level in the PISA training set. The maximum proof level can reach $26$ (b) Comparison between the number of steps in the original proof and the extracted proof sketches. By breaking the original proof into proof sketches, the proof length is reduced substantially.

Proving Theorems Recursively

TL;DR

Abstract

Proving Theorems Recursively

Authors

TL;DR

Abstract

Table of Contents

Figures (5)