Ranking LLM-Generated Loop Invariants for Program Verification

Saikat Chakraborty; Shuvendu K. Lahiri; Sarah Fakhoury; Madanlal Musuvathi; Akash Lal; Aseem Rastogi; Aditya Senthilnathan; Rahul Sharma; Nikhil Swamy

Ranking LLM-Generated Loop Invariants for Program Verification

Saikat Chakraborty, Shuvendu K. Lahiri, Sarah Fakhoury, Madanlal Musuvathi, Akash Lal, Aseem Rastogi, Aditya Senthilnathan, Rahul Sharma, Nikhil Swamy

TL;DR

This paper addresses automating loop invariant inference for program verification by leveraging LLMs to generate candidate invariants and re-ranking them to minimize verifier calls. It introduces iRank, a contrastive learning-based ranker that maps problem definitions and invariants into a shared embedding space and maximizes similarity for correct invariants while distancing incorrect ones. Using a dataset of problems with verified and incorrect invariants and two embedding backends, experiments show the median verified invariant moves from 31 to as low as 4 and V@K improves, substantially reducing Z3 invocations. The method is orthogonal to the invariant generator and can improve efficiency for interactive verification workflows and other generators.

Abstract

Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an invariant. To address this issue, we propose a {\it re-ranking} approach for the generated results of LLMs. We have designed a ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier. The source code and the experimental data for this paper are available in \url{https://github.com/microsoft/NeuralInvariantRanker}.

Ranking LLM-Generated Loop Invariants for Program Verification

TL;DR

Abstract

Paper Structure (21 sections, 5 equations, 8 figures, 5 tables)

This paper contains 21 sections, 5 equations, 8 figures, 5 tables.

Introduction
Related Work
Background & Motivation
Background: Loop Invariant Inference
Motivation and Problem Formulations
iRank : Methodology
Experimental Design and Results
Experimental Setup
Results
Conclusion
Assumptions of LLM inference cost.
Comparison with state-of-the-art (SOTA) invariant synthesis.
Stability of LLM predictions.
Further Details of iRank
Training iRank
...and 6 more sections

Figures (8)

Figure 1: LLM for Loop Invariant synthesis.
Figure 2: Detailed results comparing ranks of the correct invariants and number of z3 calls.
Figure 3: Percentage of problems solved w.r.t. number of invariants from the (re-)ranked list.
Figure 4: Invariant synthesis problem in FiB-8.sl
Figure 5: Correct invariant generated by gpt-3.5-turbo.
...and 3 more figures

Ranking LLM-Generated Loop Invariants for Program Verification

TL;DR

Abstract

Ranking LLM-Generated Loop Invariants for Program Verification

Authors

TL;DR

Abstract

Table of Contents

Figures (8)