Table of Contents
Fetching ...

Towards General Loop Invariant Generation: A Benchmark of Programs with Memory Manipulation

Chang Liu, Xiwei Wu, Yuan Feng, Qinxiang Cao, Junchi Yan

TL;DR

The paper tackles the challenge of general loop invariant generation for programs with memory manipulation by introducing the LIG-MM benchmark and showing that existing methods struggle on such programs. It introduces LLM-SE, a framework that couples large language models with symbolic execution and self-supervised fine-tuning to generate valid loop invariants, validated on LIG-MM. The approach leverages predicate recovery and an offline-online training paradigm to produce invariants that are subsequently checked by entailment solvers and a SAT-based picker for conciseness. Empirical results indicate that LLM-SE outperforms state-of-the-art baselines on LIG-MM, signaling a new direction for automated program verification in real-world contexts. The work highlights the potential of integrating LLM-based reasoning with formal methods to handle complex data structures and memory operations across multi-loop programs.

Abstract

Program verification is vital for ensuring software reliability, especially in the context of increasingly complex systems. Loop invariants, remaining true before and after each iteration of loops, are crucial for this verification process. Traditional provers and machine learning based methods for generating loop invariants often require expert intervention or extensive labeled data, and typically only handle numerical property verification. These methods struggle with programs involving complex data structures and memory manipulations, limiting their applicability and automation capabilities. In this paper, we introduce a new benchmark named LIG-MM, specifically for programs with complex data structures and memory manipulations. We collect 312 programs from various sources, including daily programs from college homework, the international competition (SV-COMP), benchmarks from previous papers (SLING), and programs from real-world software systems (Linux Kernel, GlibC, LiteOS, and Zephyr). Based on LIG-MM, our findings indicate that previous methods, including GPT-4, fail to automate verification for these programs. Consequently, we propose a novel LLM-SE framework that coordinates LLM with symbolic execution, fine-tuned using self-supervised learning, to generate loop invariants. Experimental results on LIG-MM demonstrate that our LLM-SE outperforms state-of-the-art methods, offering a new direction toward automated program verification in real-world scenarios.

Towards General Loop Invariant Generation: A Benchmark of Programs with Memory Manipulation

TL;DR

The paper tackles the challenge of general loop invariant generation for programs with memory manipulation by introducing the LIG-MM benchmark and showing that existing methods struggle on such programs. It introduces LLM-SE, a framework that couples large language models with symbolic execution and self-supervised fine-tuning to generate valid loop invariants, validated on LIG-MM. The approach leverages predicate recovery and an offline-online training paradigm to produce invariants that are subsequently checked by entailment solvers and a SAT-based picker for conciseness. Empirical results indicate that LLM-SE outperforms state-of-the-art baselines on LIG-MM, signaling a new direction for automated program verification in real-world contexts. The work highlights the potential of integrating LLM-based reasoning with formal methods to handle complex data structures and memory operations across multi-loop programs.

Abstract

Program verification is vital for ensuring software reliability, especially in the context of increasingly complex systems. Loop invariants, remaining true before and after each iteration of loops, are crucial for this verification process. Traditional provers and machine learning based methods for generating loop invariants often require expert intervention or extensive labeled data, and typically only handle numerical property verification. These methods struggle with programs involving complex data structures and memory manipulations, limiting their applicability and automation capabilities. In this paper, we introduce a new benchmark named LIG-MM, specifically for programs with complex data structures and memory manipulations. We collect 312 programs from various sources, including daily programs from college homework, the international competition (SV-COMP), benchmarks from previous papers (SLING), and programs from real-world software systems (Linux Kernel, GlibC, LiteOS, and Zephyr). Based on LIG-MM, our findings indicate that previous methods, including GPT-4, fail to automate verification for these programs. Consequently, we propose a novel LLM-SE framework that coordinates LLM with symbolic execution, fine-tuned using self-supervised learning, to generate loop invariants. Experimental results on LIG-MM demonstrate that our LLM-SE outperforms state-of-the-art methods, offering a new direction toward automated program verification in real-world scenarios.
Paper Structure (25 sections, 3 equations, 5 figures, 2 tables, 3 algorithms)

This paper contains 25 sections, 3 equations, 5 figures, 2 tables, 3 algorithms.

Figures (5)

  • Figure 1: A numerical program with a correctness assertion and a loop invariant to prove it.
  • Figure 2: (Left) An example of programs with the data structure like the single linked list; (Right) The pipeline of our proposed LLM-SE: we start with the precondition (S0) and conduct the loop body with symbolic execution multiple times to get more separation logic assertions (S1, S2, S3, ...). Based on these separation logic assertions, we further use LLM to infer the loop invariant.
  • Figure 3: Our proposed framework LLM-SE. (Left) Offline training: we construct an auxiliary task by splitting and recovering the predicates of data structures, following the self-supervised learning paradigm to finetune LLM; (Right) Online querying: we design an interactive system to handle the programs needed to be verified. The well-trained LLM is directly applied to unseen programs. Multiple verification tools are utilized to cooperate with the LLM to generate valid loop invariants.
  • Figure 4: The one attempt pass rate (Pass Rate @ 1) on every source of programs in our LIG-MM, where the x-axis denotes the different sources and four bars represent the pass rates of four methods.
  • Figure 5: Our self-supervised learning paradigm for fine-tuning LLM. To solve the data scarcity issue, we design an auxiliary task called predicate recovery. We split the data structure completely based on its definitions, further process them from unfolded expressions to synthetic separation logic assertions, and let the LLM try to recover the original separating conjunct. Moreover, we apply multiple data augmentation and mix up strategies to refine the synthetic assertions to mimic the real ones, making the auxiliary task more challenging.