Adaptive Proof Refinement with LLM-Guided Strategy Selection
Minghai Lu, Zhe Zhou, Danning Xie, Songlin Jia, Benjamin Delaware, Tianyi Zhang
TL;DR
Formal verification with theorem proving is powerful but hard to scale due to manual effort and expertise requirements. Adapt introduces an adaptive proof refinement framework that uses an LLM-guided decision-maker to dynamically select refinement strategies (lemma discovery, context enrichment, regeneration) based on the proof state and available context, enabling iterative refinement that responds to specific failures. Empirical evaluation on CoqStoq and a newly mined CoqDev benchmark shows Adapt substantially outperforms state-of-the-art baselines and generalizes across five LLMs, with ablations confirming the value of its component strategies and decision-maker design. The work also provides CoqDev, a realistic benchmark mined from real-world Coq commits, to better reflect development workflows and validate the framework’s practical impact in improving automated proof construction.
Abstract
Formal verification via theorem proving enables the expressive specification and rigorous proof of software correctness, but it is difficult to scale due to the significant manual effort and expertise required. While Large Language Models (LLMs) show potential in proof generation, they frequently produce incorrect proofs on the first attempt and require additional strategies for iterative refinement. However, existing approaches employ fixed refinement strategies and cannot dynamically choose an effective strategy based on the particular issues in a generated proof, which limits their performance. To overcome this limitation, we introduce Adapt, a novel proof refinement framework that leverages an LLM-guided decision-maker to dynamically select a suitable refinement strategy according to the state of the proof assistant and available context of an incorrect proof. We evaluate Adapt on two benchmarks against four existing methods and find that it significantly outperforms the best baseline on both by proving 16.63% and 18.58% more theorems, respectively. Furthermore, we demonstrate Adapt's generalizability by evaluating it across five different LLMs. We also conduct ablation studies to measure the contribution of each component and compare the trade-offs of alternative decision-maker designs.
