Table of Contents
Fetching ...

Lemur: Integrating Large Language Models in Automated Program Verification

Haoze Wu, Clark Barrett, Nina Narodytska

TL;DR

Lemur introduces a formal, audio-verified framework that couples Large Language Models with automated verifiers to tackle automated program verification. It formalizes the interaction as a proof calculus with proform goals proposed and repaired by LLMs while checked by back-end verifiers, and proves soundness with a terminating instantiation. Empirically, lemur with GPT-4 substantially improves invariant generation and can solve hard SV-COMP benchmarks that challenge traditional verifiers, while yielding strong gains on Code2Inv benchmarks. The approach offers a scalable, modular way to leverage LLMs for high-level reasoning in program verification, compatible with existing tools and adaptable to different back-ends.

Abstract

The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of transition rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure and demonstrate practical improvements on a set of synthetic and competition benchmarks.

Lemur: Integrating Large Language Models in Automated Program Verification

TL;DR

Lemur introduces a formal, audio-verified framework that couples Large Language Models with automated verifiers to tackle automated program verification. It formalizes the interaction as a proof calculus with proform goals proposed and repaired by LLMs while checked by back-end verifiers, and proves soundness with a terminating instantiation. Empirically, lemur with GPT-4 substantially improves invariant generation and can solve hard SV-COMP benchmarks that challenge traditional verifiers, while yielding strong gains on Code2Inv benchmarks. The approach offers a scalable, modular way to leverage LLMs for high-level reasoning in program verification, compatible with existing tools and adaptable to different back-ends.

Abstract

The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of transition rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure and demonstrate practical improvements on a set of synthetic and competition benchmarks.
Paper Structure (22 sections, 10 theorems, 1 equation, 4 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 10 theorems, 1 equation, 4 figures, 3 tables, 1 algorithm.

Key Result

Proposition 2.1

Let $\mathcal{P}\xspace$ be a program, and $p$, $q$ be properties on $\mathcal{P}$:

Figures (4)

  • Figure 1: Deductive rules of the lemur calculus.
  • Figure 2: A running example of executing the lemur calculus.
  • Figure 3: Nb. of proposals for lemur(GPT4) to solve a Code2Inv benchmark.
  • Figure 4: Nb. of proposals for lemur(GPT4) to solve an SV-COMP benchmark.

Theorems & Definitions (27)

  • Definition 2.1
  • Example 2.1
  • Definition 2.2
  • Definition 2.3
  • Example 2.2
  • Proposition 2.1
  • Proposition 2.2
  • Proposition 2.3
  • Example 2.3
  • Definition 2.4
  • ...and 17 more