LLM For Loop Invariant Generation and Fixing: How Far Are We?
Mostafijur Rahman Akhond, Saikat Chakraborty, Gias Uddin
TL;DR
This study empirically evaluates whether large language models can generate and repair loop invariants for SMT specifications. It systematically compares instruction-based prompts, few-shot prompts, problem partitioning, and the use of verifier feedback across several LLMs, including GPT-4o and Mistral-large. The results show that guided prompts can yield up to 78% success in generating correct invariants (with integrated instruction+few-shot prompts), while repair remains difficult, with success up to 16% when given verifier-derived error details. The findings highlight the potential of LLM-guided invariant synthesis, while also revealing limitations in integration and repair, suggesting directions for more robust prompting, combining domain knowledge with exemplars, and enhanced feedback mechanisms. These insights are relevant for researchers and practitioners aiming to leverage LLMs to assist formal verification tasks in scalable software engineering workflows.
Abstract
A loop invariant is a property of a loop that remains true before and after each execution of the loop. The identification of loop invariants is a critical step to support automated program safety assessment. Recent advancements in Large Language Models (LLMs) have demonstrated potential in diverse software engineering (SE) and formal verification tasks. However, we are not aware of the performance of LLMs to infer loop invariants. We report an empirical study of both open-source and closed-source LLMs of varying sizes to assess their proficiency in inferring inductive loop invariants for programs and in fixing incorrect invariants. Our findings reveal that while LLMs exhibit some utility in inferring and repairing loop invariants, their performance is substantially enhanced when supplemented with auxiliary information such as domain knowledge and illustrative examples. LLMs achieve a maximum success rate of 78\% in generating, but are limited to 16\% in repairing the invariant.
