Table of Contents
Fetching ...

LLM For Loop Invariant Generation and Fixing: How Far Are We?

Mostafijur Rahman Akhond, Saikat Chakraborty, Gias Uddin

TL;DR

This study empirically evaluates whether large language models can generate and repair loop invariants for SMT specifications. It systematically compares instruction-based prompts, few-shot prompts, problem partitioning, and the use of verifier feedback across several LLMs, including GPT-4o and Mistral-large. The results show that guided prompts can yield up to 78% success in generating correct invariants (with integrated instruction+few-shot prompts), while repair remains difficult, with success up to 16% when given verifier-derived error details. The findings highlight the potential of LLM-guided invariant synthesis, while also revealing limitations in integration and repair, suggesting directions for more robust prompting, combining domain knowledge with exemplars, and enhanced feedback mechanisms. These insights are relevant for researchers and practitioners aiming to leverage LLMs to assist formal verification tasks in scalable software engineering workflows.

Abstract

A loop invariant is a property of a loop that remains true before and after each execution of the loop. The identification of loop invariants is a critical step to support automated program safety assessment. Recent advancements in Large Language Models (LLMs) have demonstrated potential in diverse software engineering (SE) and formal verification tasks. However, we are not aware of the performance of LLMs to infer loop invariants. We report an empirical study of both open-source and closed-source LLMs of varying sizes to assess their proficiency in inferring inductive loop invariants for programs and in fixing incorrect invariants. Our findings reveal that while LLMs exhibit some utility in inferring and repairing loop invariants, their performance is substantially enhanced when supplemented with auxiliary information such as domain knowledge and illustrative examples. LLMs achieve a maximum success rate of 78\% in generating, but are limited to 16\% in repairing the invariant.

LLM For Loop Invariant Generation and Fixing: How Far Are We?

TL;DR

This study empirically evaluates whether large language models can generate and repair loop invariants for SMT specifications. It systematically compares instruction-based prompts, few-shot prompts, problem partitioning, and the use of verifier feedback across several LLMs, including GPT-4o and Mistral-large. The results show that guided prompts can yield up to 78% success in generating correct invariants (with integrated instruction+few-shot prompts), while repair remains difficult, with success up to 16% when given verifier-derived error details. The findings highlight the potential of LLM-guided invariant synthesis, while also revealing limitations in integration and repair, suggesting directions for more robust prompting, combining domain knowledge with exemplars, and enhanced feedback mechanisms. These insights are relevant for researchers and practitioners aiming to leverage LLMs to assist formal verification tasks in scalable software engineering workflows.

Abstract

A loop invariant is a property of a loop that remains true before and after each execution of the loop. The identification of loop invariants is a critical step to support automated program safety assessment. Recent advancements in Large Language Models (LLMs) have demonstrated potential in diverse software engineering (SE) and formal verification tasks. However, we are not aware of the performance of LLMs to infer loop invariants. We report an empirical study of both open-source and closed-source LLMs of varying sizes to assess their proficiency in inferring inductive loop invariants for programs and in fixing incorrect invariants. Our findings reveal that while LLMs exhibit some utility in inferring and repairing loop invariants, their performance is substantially enhanced when supplemented with auxiliary information such as domain knowledge and illustrative examples. LLMs achieve a maximum success rate of 78\% in generating, but are limited to 16\% in repairing the invariant.

Paper Structure

This paper contains 38 sections, 3 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: An Example C program and its translation to SMT formulae. Here, $x!$ and $y!$ denote the updated state of variable $x$ and $y$, respectively, after each iteration of the loop. The PreF denotes the preconditions, PostF denotes the postcondition, and the TransF denotes the summary of the loop. The InvF is the inductive loop invariant that can verify that the program follows these specifications.
  • Figure 2: General workflow of the experiment
  • Figure 3: Prompt template designed to generate invariants using instructions with a large language model.
  • Figure 4: Prompt template to utilize partial invariant synthesis
  • Figure 5: Prompt for combining partial invariants
  • ...and 7 more figures