Don't Eliminate Cut: Exponential Separations in LLM-Based Theorem Proving
Sho Sonoda, Shunta Akiyama, Yuya Uezato
TL;DR
The paper develops a theoretical framework for LLM-guided theorem proving as a finite-horizon deterministic MDP over compact metric spaces, with Lipschitz policy classes. It introduces two data-generation regimes: a cut-free baseline and a latent proof-DAG model that captures reusable cuts/lemmas, enabling analysis of hierarchical search via EM/VAE-style learning. Under Tsybakov-type margin assumptions and sequential generalization bounds, the authors derive lower bounds on finite-horizon success probability that separate search and imitation errors, and show an exponential separation: when cut-elimination inflates a DAG into a large cut-free tree, a cut-aware hierarchical learner requires exponentially less data to achieve a target success level. This provides a principled explanation for the effectiveness of subgoal decomposition in modern agentic theorem provers and highlights how structured latent representations can dramatically improve sample efficiency in rigorous automated reasoning.
Abstract
We develop a theoretical analysis of LLM-guided formal theorem proving in interactive proof assistants (e.g., Lean) by modeling tactic proposal as a stochastic policy in a finite-horizon deterministic MDP. To capture modern representation learning, we treat the state and action spaces as general compact metric spaces and assume Lipschitz policies. To explain the gap between worst-case hardness and empirical success, we introduce problem distributions generated by a reference policy $q$, including a latent-variable model in which proofs exhibit reusable cut/lemma/sketch structure represented by a proof DAG. Under a top-$k$ search protocol and Tsybakov-type margin conditions, we derive lower bounds on finite-horizon success probability that decompose into search and learning terms, with learning controlled by sequential Rademacher/covering complexity. Our main separation result shows that when cut elimination expands a DAG of depth $D$ into a cut-free tree of size $Ω(Λ^D)$ while the cut-aware hierarchical process has size $O(λ^D)$ with $λ\llΛ$, a flat (cut-free) learner provably requires exponentially more data than a cut-aware hierarchical learner. This provides a principled justification for subgoal decomposition in recent agentic theorem provers.
