Table of Contents
Fetching ...

Towards AI-Assisted Synthesis of Verified Dafny Methods

Md Rakib Hossain Misu, Cristina V. Lopes, Iris Ma, James Noble

TL;DR

This paper investigates whether contemporary Large Language Models can meaningfully synthesize verifiable Dafny methods by pairing advanced prompting techniques with formal verification constraints. It introduces MBPP-san-DFY and MBPP-DFY-50 to evaluate end-to-end synthesis of specifications and implementations, and demonstrates that Chain-of-Thought prompts augmented with Retrieval-Augmented Generation enable substantially higher verified-method rates, especially for GPT-4 (up to about $64\%$ verify@5 on 178 problems) compared to baseline prompts. Key findings include the critical role of strong postconditions, the benefits of dynamic few-shot prompting for generating complete specifications and invariants, and the observation that more contextual prompts (Signature) may hinder performance for verification tasks. The work provides a foundation for using LLMs as a component in a Programmer’s Verification Apprentice, with practical implications for accelerating formally verified code in Dafny and guiding future research on prompt design and verification-enabled AI programming.

Abstract

Large language models show great promise in many domains, including programming. A promise is easy to make but hard to keep, and language models often fail to keep their promises, generating erroneous code. A promising avenue to keep models honest is to incorporate formal verification: generating programs' specifications as well as code so that the code can be proved correct with respect to the specifications. Unfortunately, existing large language models show a severe lack of proficiency in verified programming. In this paper, we demonstrate how to improve two pretrained models' proficiency in the Dafny verification-aware language. Using 178 problems from the MBPP dataset, we prompt two contemporary models (GPT-4 and PaLM-2) to synthesize Dafny methods. We use three different types of prompts: a direct Contextless prompt; a Signature prompt that includes a method signature and test cases, and a Chain of Thought (CoT) prompt that decomposes the problem into steps and includes retrieval augmentation generated example problems and solutions. Our results show that GPT-4 performs better than PaLM-2 on these tasks and that both models perform best with the retrieval augmentation generated CoT prompt. GPT-4 was able to generate verified, human-evaluated, Dafny methods for 58% of the problems, however, GPT-4 managed only 19% of the problems with the Contextless prompt, and even fewer (10%) for the Signature prompt. We are thus able to contribute 153 verified Dafny solutions to MBPP problems, 50 that we wrote manually, and 103 synthesized by GPT-4. Our results demonstrate that the benefits of formal program verification are now within reach of code generating large language models...

Towards AI-Assisted Synthesis of Verified Dafny Methods

TL;DR

This paper investigates whether contemporary Large Language Models can meaningfully synthesize verifiable Dafny methods by pairing advanced prompting techniques with formal verification constraints. It introduces MBPP-san-DFY and MBPP-DFY-50 to evaluate end-to-end synthesis of specifications and implementations, and demonstrates that Chain-of-Thought prompts augmented with Retrieval-Augmented Generation enable substantially higher verified-method rates, especially for GPT-4 (up to about verify@5 on 178 problems) compared to baseline prompts. Key findings include the critical role of strong postconditions, the benefits of dynamic few-shot prompting for generating complete specifications and invariants, and the observation that more contextual prompts (Signature) may hinder performance for verification tasks. The work provides a foundation for using LLMs as a component in a Programmer’s Verification Apprentice, with practical implications for accelerating formally verified code in Dafny and guiding future research on prompt design and verification-enabled AI programming.

Abstract

Large language models show great promise in many domains, including programming. A promise is easy to make but hard to keep, and language models often fail to keep their promises, generating erroneous code. A promising avenue to keep models honest is to incorporate formal verification: generating programs' specifications as well as code so that the code can be proved correct with respect to the specifications. Unfortunately, existing large language models show a severe lack of proficiency in verified programming. In this paper, we demonstrate how to improve two pretrained models' proficiency in the Dafny verification-aware language. Using 178 problems from the MBPP dataset, we prompt two contemporary models (GPT-4 and PaLM-2) to synthesize Dafny methods. We use three different types of prompts: a direct Contextless prompt; a Signature prompt that includes a method signature and test cases, and a Chain of Thought (CoT) prompt that decomposes the problem into steps and includes retrieval augmentation generated example problems and solutions. Our results show that GPT-4 performs better than PaLM-2 on these tasks and that both models perform best with the retrieval augmentation generated CoT prompt. GPT-4 was able to generate verified, human-evaluated, Dafny methods for 58% of the problems, however, GPT-4 managed only 19% of the problems with the Contextless prompt, and even fewer (10%) for the Signature prompt. We are thus able to contribute 153 verified Dafny solutions to MBPP problems, 50 that we wrote manually, and 103 synthesized by GPT-4. Our results demonstrate that the benefits of formal program verification are now within reach of code generating large language models...
Paper Structure (39 sections, 5 figures, 5 tables)

This paper contains 39 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: An example Dafny method; FindSmallest
  • Figure 2: Examples of Contextless and Signature prompts
  • Figure 3: A high-level architecture of the RAG prompt-chaining pipeline for Dynamic Few-Shot Prompting.
  • Figure 4: Examples of Specification Prompt (top) and CoT Prompt (bottom)
  • Figure 5: Temperature tuning for GPT-4 and PaLM-2 on a sample of 50 problems at four different temperatures $T\in\{0.0,0.25,0.5,0.75\}$ for three different prompts. The vertical axis shows the verified Dafny method synthesis rate in verify@k (k=5).