Lemur: Integrating Large Language Models in Automated Program Verification
Haoze Wu, Clark Barrett, Nina Narodytska
TL;DR
Lemur introduces a formal, audio-verified framework that couples Large Language Models with automated verifiers to tackle automated program verification. It formalizes the interaction as a proof calculus with proform goals proposed and repaired by LLMs while checked by back-end verifiers, and proves soundness with a terminating instantiation. Empirically, lemur with GPT-4 substantially improves invariant generation and can solve hard SV-COMP benchmarks that challenge traditional verifiers, while yielding strong gains on Code2Inv benchmarks. The approach offers a scalable, modular way to leverage LLMs for high-level reasoning in program verification, compatible with existing tools and adaptable to different back-ends.
Abstract
The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of transition rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure and demonstrate practical improvements on a set of synthetic and competition benchmarks.
