Table of Contents
Fetching ...

Adaptive Proof Refinement with LLM-Guided Strategy Selection

Minghai Lu, Zhe Zhou, Danning Xie, Songlin Jia, Benjamin Delaware, Tianyi Zhang

TL;DR

Formal verification with theorem proving is powerful but hard to scale due to manual effort and expertise requirements. Adapt introduces an adaptive proof refinement framework that uses an LLM-guided decision-maker to dynamically select refinement strategies (lemma discovery, context enrichment, regeneration) based on the proof state and available context, enabling iterative refinement that responds to specific failures. Empirical evaluation on CoqStoq and a newly mined CoqDev benchmark shows Adapt substantially outperforms state-of-the-art baselines and generalizes across five LLMs, with ablations confirming the value of its component strategies and decision-maker design. The work also provides CoqDev, a realistic benchmark mined from real-world Coq commits, to better reflect development workflows and validate the framework’s practical impact in improving automated proof construction.

Abstract

Formal verification via theorem proving enables the expressive specification and rigorous proof of software correctness, but it is difficult to scale due to the significant manual effort and expertise required. While Large Language Models (LLMs) show potential in proof generation, they frequently produce incorrect proofs on the first attempt and require additional strategies for iterative refinement. However, existing approaches employ fixed refinement strategies and cannot dynamically choose an effective strategy based on the particular issues in a generated proof, which limits their performance. To overcome this limitation, we introduce Adapt, a novel proof refinement framework that leverages an LLM-guided decision-maker to dynamically select a suitable refinement strategy according to the state of the proof assistant and available context of an incorrect proof. We evaluate Adapt on two benchmarks against four existing methods and find that it significantly outperforms the best baseline on both by proving 16.63% and 18.58% more theorems, respectively. Furthermore, we demonstrate Adapt's generalizability by evaluating it across five different LLMs. We also conduct ablation studies to measure the contribution of each component and compare the trade-offs of alternative decision-maker designs.

Adaptive Proof Refinement with LLM-Guided Strategy Selection

TL;DR

Formal verification with theorem proving is powerful but hard to scale due to manual effort and expertise requirements. Adapt introduces an adaptive proof refinement framework that uses an LLM-guided decision-maker to dynamically select refinement strategies (lemma discovery, context enrichment, regeneration) based on the proof state and available context, enabling iterative refinement that responds to specific failures. Empirical evaluation on CoqStoq and a newly mined CoqDev benchmark shows Adapt substantially outperforms state-of-the-art baselines and generalizes across five LLMs, with ablations confirming the value of its component strategies and decision-maker design. The work also provides CoqDev, a realistic benchmark mined from real-world Coq commits, to better reflect development workflows and validate the framework’s practical impact in improving automated proof construction.

Abstract

Formal verification via theorem proving enables the expressive specification and rigorous proof of software correctness, but it is difficult to scale due to the significant manual effort and expertise required. While Large Language Models (LLMs) show potential in proof generation, they frequently produce incorrect proofs on the first attempt and require additional strategies for iterative refinement. However, existing approaches employ fixed refinement strategies and cannot dynamically choose an effective strategy based on the particular issues in a generated proof, which limits their performance. To overcome this limitation, we introduce Adapt, a novel proof refinement framework that leverages an LLM-guided decision-maker to dynamically select a suitable refinement strategy according to the state of the proof assistant and available context of an incorrect proof. We evaluate Adapt on two benchmarks against four existing methods and find that it significantly outperforms the best baseline on both by proving 16.63% and 18.58% more theorems, respectively. Furthermore, we demonstrate Adapt's generalizability by evaluating it across five different LLMs. We also conduct ablation studies to measure the contribution of each component and compare the trade-offs of alternative decision-maker designs.

Paper Structure

This paper contains 27 sections, 11 figures, 5 tables, 2 algorithms.

Figures (11)

  • Figure 1: Overview of Adapt.
  • Figure 2: An example of an established Coq context, including a definitions (nat), a boolean function (leb), and a inductive relation (le), along with several lemmas and a theorem.
  • Figure 3: Example of theorem proving in Coq.
  • Figure 4: Intermediate proof states.
  • Figure 5: Prompt template for initial proof generation.
  • ...and 6 more figures