Ranking LLM-Generated Loop Invariants for Program Verification
Saikat Chakraborty, Shuvendu K. Lahiri, Sarah Fakhoury, Madanlal Musuvathi, Akash Lal, Aseem Rastogi, Aditya Senthilnathan, Rahul Sharma, Nikhil Swamy
TL;DR
This paper addresses automating loop invariant inference for program verification by leveraging LLMs to generate candidate invariants and re-ranking them to minimize verifier calls. It introduces iRank, a contrastive learning-based ranker that maps problem definitions and invariants into a shared embedding space and maximizes similarity for correct invariants while distancing incorrect ones. Using a dataset of problems with verified and incorrect invariants and two embedding backends, experiments show the median verified invariant moves from 31 to as low as 4 and V@K improves, substantially reducing Z3 invocations. The method is orthogonal to the invariant generator and can improve efficiency for interactive verification workflows and other generators.
Abstract
Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an invariant. To address this issue, we propose a {\it re-ranking} approach for the generated results of LLMs. We have designed a ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier. The source code and the experimental data for this paper are available in \url{https://github.com/microsoft/NeuralInvariantRanker}.
