Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?

Madeline Endres; Sarah Fakhoury; Saikat Chakraborty; Shuvendu K. Lahiri

Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?

Madeline Endres, Sarah Fakhoury, Saikat Chakraborty, Shuvendu K. Lahiri

TL;DR

This work investigates whether Large Language Models can translate informal natural language descriptions of code into formal, executable postconditions that reflect programmer intent. It defines robust metrics for correctness and completeness, and designs prompts to generate postconditions that are testable, language-agnostic, and discriminative against buggy implementations. Through extensive experiments on EvalPlus (Python) and Def defects4J (Java), the study shows that models like GPT-4 can produce high-quality postconditions with strong bug-discrimination power, catching numerous real-world bugs and outperforming or complementing established baselines such as TOGA and Daikon. The findings suggest that NL-to-postcondition translation is a feasible and practically valuable direction for enhancing code trustworthiness and debugging in AI-assisted programming, with public artifacts to foster further research.

Abstract

Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a programs intent. However, there is typically no guarantee that a programs implementation and natural language documentation are aligned. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enhance fault localization, debugging, and code trustworthiness. In practice, however, this information is often underutilized due to the inherent ambiguity of natural language which makes natural language intent challenging to check programmatically. The emergent abilities of Large Language Models (LLMs) have the potential to facilitate the translation of natural language intent to programmatically checkable assertions. However, it is unclear if LLMs can correctly translate informal natural language specifications into formal specifications that match programmer intent. Additionally, it is unclear if such translation could be useful in practice. In this paper, we describe nl2postcond, the problem of leveraging LLMs for transforming informal natural language to formal method postconditions, expressed as program assertions. We introduce and validate metrics to measure and compare different nl2postcond approaches, using the correctness and discriminative power of generated postconditions. We then use qualitative and quantitative methods to assess the quality of nl2postcond postconditions, finding that they are generally correct and able to discriminate incorrect code. Finally, we find that nl2postcond via LLMs has the potential to be helpful in practice; nl2postcond generated postconditions were able to catch 64 real-world historical bugs from Defects4J.

Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?

TL;DR

Abstract

Paper Structure (44 sections, 2 equations, 7 figures, 4 tables)

This paper contains 44 sections, 2 equations, 7 figures, 4 tables.

Introduction
Motivating Examples
Formalizing User Intent
Detecting Real-World Functional Bugs
Overview
Contributions
nl2postcond: Overall Approach
Problem formulation and metrics
Test-set correctness
Test-set completeness for code mutants: bug-completeness-score
Prompt Design for LLM-based Postcondition Generation
RQ1: How well do LLM-generated postconditions formalize informal natural language intent?
RQ1 Experimental Setup
Evaluation Benchmark
Large Language Models
...and 29 more sections

Figures (7)

Figure 1: Example of how postconditions could be used to clarify ambiguous natural language specifications.
Figure 2: Example of how postconditions or other formal specifications of program behavior could catch bugs. This example is a historical bug from Defects4J (Math-9): the Line constructor does not return a new line with enough precision. The postconditions were generated by GPT-4in our evaluation, and both catch the bug.
Figure 3: Prompt template for generating postconditions from natural language via chat models (including changes needed for the simpleand no reference variations). We found that the bold text greatly improved the quality of the postconditions: without it, the model tended to return point-based tests or code blocks with side effects. While modified here slightly for clarity, our full prompts are included in our replication package.
Figure 4: Example of how the baseand simpleprompt variations can impact postcondition construction. Both postconditions were generated for HumanEval problem 12 using GPT-4.
Figure 5: Example from Defects4J (Cli project, bug 8) where the bug can be caught via nl2postcond. Cli 8 is a bug in the implementation for calculating the width of lines when wrapping output text. The natural language function description specifically says that each line must be at most width characters long. GPT-4translates this intent into the provided postcondition, which correctly catches the bug.
...and 2 more figures

Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?

TL;DR

Abstract

Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?

Authors

TL;DR

Abstract

Table of Contents

Figures (7)