Table of Contents
Fetching ...

Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference

Thanh Le-Cong, Bach Le, Toby Murray

TL;DR

FormalBench provides a large-scale, language-integrated benchmark for evaluating LLMs on formal specification inference, addressing gaps in prior work by combining a diverse Java program corpus with rigorous consistency, completeness, and robustness metrics. Through eight models and four evaluation axes, the study reveals that current LLMs struggle with complex control flow and semantic-preserving transformations, though prompting strategies like Least-to-Most markedly improve performance and enable self-repair to recover a substantial fraction of failures. The work also demonstrates the feasibility and value of mutation-based completeness proxies and deductive verification for reliable evaluation, culminating in an open-source toolkit to spur future advances in formal program reasoning.

Abstract

Large Language Models (LLMs) are increasingly being used to automate programming tasks. Yet, LLMs' capabilities in reasoning about program semantics are still inadequately studied, leaving significant potential for further exploration. This paper introduces FormalBench, a comprehensive benchmark designed to evaluate LLMs' reasoning abilities on program semantics, particularly via the task of synthesizing formal program specifications to assist verifying program correctness. This task requires both comprehensive reasoning over all possible program executions and the generation of precise, syntactically correct expressions that adhere to formal syntax and semantics. Using this benchmark, we evaluated the ability of LLMs in synthesizing consistent and complete specifications. Our findings show that LLMs perform well with simple control flows but struggle with more complex structures, especially loops, even with advanced prompting. Additionally, LLMs exhibit limited robustness against semantic-preserving transformations. We also highlight common failure patterns and design self-repair prompts, improving success rates by 25%.

Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference

TL;DR

FormalBench provides a large-scale, language-integrated benchmark for evaluating LLMs on formal specification inference, addressing gaps in prior work by combining a diverse Java program corpus with rigorous consistency, completeness, and robustness metrics. Through eight models and four evaluation axes, the study reveals that current LLMs struggle with complex control flow and semantic-preserving transformations, though prompting strategies like Least-to-Most markedly improve performance and enable self-repair to recover a substantial fraction of failures. The work also demonstrates the feasibility and value of mutation-based completeness proxies and deductive verification for reliable evaluation, culminating in an open-source toolkit to spur future advances in formal program reasoning.

Abstract

Large Language Models (LLMs) are increasingly being used to automate programming tasks. Yet, LLMs' capabilities in reasoning about program semantics are still inadequately studied, leaving significant potential for further exploration. This paper introduces FormalBench, a comprehensive benchmark designed to evaluate LLMs' reasoning abilities on program semantics, particularly via the task of synthesizing formal program specifications to assist verifying program correctness. This task requires both comprehensive reasoning over all possible program executions and the generation of precise, syntactically correct expressions that adhere to formal syntax and semantics. Using this benchmark, we evaluated the ability of LLMs in synthesizing consistent and complete specifications. Our findings show that LLMs perform well with simple control flows but struggle with more complex structures, especially loops, even with advanced prompting. Additionally, LLMs exhibit limited robustness against semantic-preserving transformations. We also highlight common failure patterns and design self-repair prompts, improving success rates by 25%.

Paper Structure

This paper contains 28 sections, 5 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Illustration of a Java program annotated with JML specifications (highlighted in green).
  • Figure 2: Distribution of our datasets and SpecGenBench over different control flow types.
  • Figure 3: Performance versus cost per task of proprietary large language models (LLMs).
  • Figure 4: Top-10 failure category of LLMs with various prompts
  • Figure 5: Effectiveness and self-repair rates of LLMs across iterations: "Iter $i$" represents self-repair with feedback, while "mutation" represents results from mutation-based repair in the final iteration.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2