Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition Formalization

I. S. W. B. Prasetya; Fitsum Kifetew; Davide Prandi

Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition Formalization

I. S. W. B. Prasetya, Fitsum Kifetew, Davide Prandi

Abstract

Formal specifications, such as pre- and post-conditions provide a solid basis for performing thorough program verification. However, developers rarely provide such formal specifications, hence if AI could help in constructing them, it would make formal verification possible or at least make automated testing much more effective. This paper presents a study on the ability of LLMs in generating formal FULL pre- and post-conditions of a program, given its natural language description. 24 state-of-the-art LLMs were evaluated on a freshly prepared dataset of 40 tasks. The paper investigates specifications of varying difficulty and discusses a set of more refined performance metrics in addition the general accept@1 performance. It also investigates the impact of using automatically generated tests for validation of the solutions proposed by LLMs. The results of the experiment reveal that, in general LLMs can produce valid pre- and post-conditions based on natural language descriptions of programs. Incorrect solutions from proprietary models are also often near correct. A closer inspection shows that open-source models tend to result in a higher proportion of erroneous results while proprietary models tend to have slightly higher false negative rates. Interestingly, the results also show that augmenting the manually prepared dataset with automatically generated tests leads to the exposure of wrong solutions, which would have otherwise been accepted as correct. In general, LLMs perform better in formalizing pre-conditions than on post-conditions and proprietary models perform better than open ones. However, none of the LLMs were able to correctly formalize all the tasks in our benchmark. Overall, the effectiveness and reliability of LLMs in formalizing pre- and post-conditions could be greatly improved by using a good test suite that checks the correctness of the LLM generated formalizations.

Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition Formalization

Abstract

Paper Structure (24 sections, 11 equations, 9 figures, 2 tables)

This paper contains 24 sections, 11 equations, 9 figures, 2 tables.

Introduction
Related Work
Pre- and Post-conditions
Full and partial conditions
Interpretable specification
Specification difficulty
Benchmarking Methodology
Prompting
Solution validation
Experiment Design
Research questions
Dataset
Models
Experiment protocol
Metrics
...and 9 more sections

Figures (9)

Figure 1: The prompt to get a post-condition from an LLM. $P$ is to be replaced with the name of the program whose post-condition we want to extract. The red-part is to be filled with a textual, natural language, description of $P$'s functionality. Several examples of inputs-outputs of $P$ are also included in the prompt, as part of $P$'s description.
Figure 2: The basic workflow for generating more positive and negative tests using Pynguin. $F(x)$ is the program under consideration; $preF$ and $postF$ are solution pre- and post-conditions. Using $F$ we cannot actually generate negative post-condition tests. To do the latter, $F$ is replaced by a mutant ($F$ containing a mutated code) and re-run the same workflow.
Figure 3: Boxplot showing the distribution of Mean Passing Tests. Each dot represents the average across 10 replicas of successful tests for a task. The pink asterisk indicates the average accept@1 across all tasks.
Figure 4: Number of replicas where the LLM proposed solution on a task is not valid because at least one test returns an error when executed on it. Each dot represents a task. For instance, the dot with value ten for Claude3.7 model means that there is a task for which all ten replicas contain an execution of a test that results in an error.
Figure 5: Boxplot showing the False Negative (above) and Positive (below) Rates per task. Each dot shows the average FNR/FPR across ten replicas of a model on a task. Red cross mark indicates the Error Rate.
...and 4 more figures

Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition Formalization

Abstract

Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition Formalization

Authors

Abstract

Table of Contents

Figures (9)