Table of Contents
Fetching ...

Understanding Formal Reasoning Failures in LLMs as Abstract Interpreters

Jacqueline L. Mitchell, Brian Hyeongseok Kim, Chenyu Zhou, Chao Wang

TL;DR

The paper probes whether large language models can reason about program semantics through the lens of abstract interpretation, focusing on invariant generation. It introduces two prompting strategies, Compositional and Transitional, to elicit step-by-step abstract-interpretation traces and evaluates them across four state-of-the-art LLMs on 22 SV-COMP benchmarks, verifying invariants with UAutomizer. The study reveals that LLMs can generate sound invariants in many cases but exhibit recurring reasoning errors in control-flow understanding, fixpoint computation, and operation semantics, with pronounced model- and program-dependent differences between strategies. The work highlights concrete thematic errors, analyzes their sources, and suggests opportunities for better prompting, modular context management, and possibly architectural adaptations to make LLMs more reliable for verification tasks. Overall, the findings illuminate both the potential and the current limits of using LLMs for formal verification tasks and guide future research toward more robust, reasoning-capable systems for invariant generation.

Abstract

Large language models (LLMs) are increasingly used for program verification, and yet little is known about \emph{how} they reason about program semantics during this process. In this work, we focus on abstract interpretation based-reasoning for invariant generation and introduce two novel prompting strategies that aim to elicit such reasoning from LLMs. We evaluate these strategies across several state-of-the-art LLMs on 22 programs from the SV-COMP benchmark suite widely used in software verification. We analyze both the soundness of the generated invariants and the key thematic patterns in the models' reasoning errors. This work aims to highlight new research opportunities at the intersection of LLMs and program verification for applying LLMs to verification tasks and advancing their reasoning capabilities in this application.

Understanding Formal Reasoning Failures in LLMs as Abstract Interpreters

TL;DR

The paper probes whether large language models can reason about program semantics through the lens of abstract interpretation, focusing on invariant generation. It introduces two prompting strategies, Compositional and Transitional, to elicit step-by-step abstract-interpretation traces and evaluates them across four state-of-the-art LLMs on 22 SV-COMP benchmarks, verifying invariants with UAutomizer. The study reveals that LLMs can generate sound invariants in many cases but exhibit recurring reasoning errors in control-flow understanding, fixpoint computation, and operation semantics, with pronounced model- and program-dependent differences between strategies. The work highlights concrete thematic errors, analyzes their sources, and suggests opportunities for better prompting, modular context management, and possibly architectural adaptations to make LLMs more reliable for verification tasks. Overall, the findings illuminate both the potential and the current limits of using LLMs for formal verification tasks and guide future research toward more robust, reasoning-capable systems for invariant generation.

Abstract

Large language models (LLMs) are increasingly used for program verification, and yet little is known about \emph{how} they reason about program semantics during this process. In this work, we focus on abstract interpretation based-reasoning for invariant generation and introduce two novel prompting strategies that aim to elicit such reasoning from LLMs. We evaluate these strategies across several state-of-the-art LLMs on 22 programs from the SV-COMP benchmark suite widely used in software verification. We analyze both the soundness of the generated invariants and the key thematic patterns in the models' reasoning errors. This work aims to highlight new research opportunities at the intersection of LLMs and program verification for applying LLMs to verification tasks and advancing their reasoning capabilities in this application.

Paper Structure

This paper contains 35 sections, 60 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: A simple IMP-like context-free grammar for some program $P$.
  • Figure 2: Annotated program written in the grammar of Figure \ref{['fig:imp-grammar']}. {P0} ... {P9} mark program locations and [directives] mark control flow, which are included to help LLMs identify where to compute abstract states.
  • Figure 3: Two strategies for abstract interpretation: Compositional (left) and Transitional (right). The annotated program from Figure \ref{['fig:annotated_program']} is given as the in-context input, and the texts above show the in-context outputs corresponding to the two strategies. The flowcharts on the bottom visually represent the algorithmic flow of the two strategies.
  • Figure 4: Overall flow of the Compositional strategy for our running example. It corresponds to the high-level workflow shown on the bottom left of Figure \ref{['fig:prompts']}.
  • Figure 5: Fixed point equations for Transitional Prompting for our running example.

Theorems & Definitions (2)

  • definition 1: Program Invariant Map (IM)
  • definition 2: Interval Domain