Lemur: Integrating Large Language Models in Automated Program Verification

Haoze Wu; Clark Barrett; Nina Narodytska

Lemur: Integrating Large Language Models in Automated Program Verification

Haoze Wu, Clark Barrett, Nina Narodytska

TL;DR

Lemur introduces a formal, audio-verified framework that couples Large Language Models with automated verifiers to tackle automated program verification. It formalizes the interaction as a proof calculus with proform goals proposed and repaired by LLMs while checked by back-end verifiers, and proves soundness with a terminating instantiation. Empirically, lemur with GPT-4 substantially improves invariant generation and can solve hard SV-COMP benchmarks that challenge traditional verifiers, while yielding strong gains on Code2Inv benchmarks. The approach offers a scalable, modular way to leverage LLMs for high-level reasoning in program verification, compatible with existing tools and adaptable to different back-ends.

Abstract

The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of transition rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure and demonstrate practical improvements on a set of synthetic and competition benchmarks.

Lemur: Integrating Large Language Models in Automated Program Verification

TL;DR

Abstract

Paper Structure (22 sections, 10 theorems, 1 equation, 4 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 10 theorems, 1 equation, 4 figures, 3 tables, 1 algorithm.

Introduction
Definitions
lemur: Integrating LLMs in Program Verification
Instantiating the lemur calculus
Experiments
Building an LLM-based program verifier
Loop invariant generation benchmarks
Solving hard SV-COMP benchmarks
Discussion of limitations and extensions
Background
Automated reasoning tools.
Automated reasoning tools.
Large Language Models.
Definitions
lemur: Integrating LLMs in Program Verification
...and 7 more sections

Key Result

Proposition 2.1

Let $\mathcal{P}\xspace$ be a program, and $p$, $q$ be properties on $\mathcal{P}$:

Figures (4)

Figure 1: Deductive rules of the lemur calculus.
Figure 2: A running example of executing the lemur calculus.
Figure 3: Nb. of proposals for lemur(GPT4) to solve a Code2Inv benchmark.
Figure 4: Nb. of proposals for lemur(GPT4) to solve an SV-COMP benchmark.

Theorems & Definitions (27)

Definition 2.1
Example 2.1
Definition 2.2
Definition 2.3
Example 2.2
Proposition 2.1
Proposition 2.2
Proposition 2.3
Example 2.3
Definition 2.4
...and 17 more

Lemur: Integrating Large Language Models in Automated Program Verification

TL;DR

Abstract

Lemur: Integrating Large Language Models in Automated Program Verification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (27)