Table of Contents
Fetching ...

dafny-annotator: AI-Assisted Verification of Dafny Programs

Gabriel Poesia, Chloe Loughridge, Nada Amin

TL;DR

Using a combination of Large Language Models and search to build dafny-annotator: a tool that adds logical annotations to a Dafny method until the verifier can prove it correct, which suggests a path towards capable AI assistants for languages that don't yet have large-scale human-generated examples.

Abstract

Formal verification has the potential to drastically reduce software bugs, but its high additional cost has hindered large-scale adoption. While Dafny presents a promise to significantly reduce the effort to write verified programs, users are often required to provide logical annotations to aid the verifier. Here, we explore using a combination of Large Language Models and search to build dafny-annotator: a tool that adds logical annotations to a Dafny method until the verifier can prove it correct. On a test set from the DafnyBench collection of programs, greedy search guided by LLaMa 3.1 8B successfully annotates only 15.7% of the methods. Since this data-driven approach is hindered by the lack of large-scale training data, we propose a method for open-ended synthesis of new Dafny programs in a flexible pipeline where LLMs formulate high-level ideas, implement them, and incrementally propose changes to existing programs, which Dafny validates. This gives us a synthetic dataset, DafnySynth, which we use to augment DafnyBench for training. Fine-tuning on both datasets boosts LLaMa 8B's success rate to 50.6% -- significantly better than the base model, or training on either dataset alone. Our results suggest a path towards capable AI assistants for languages that don't yet have large-scale human-generated examples. In turn, such assistants might reduce friction for users and ultimately drive adoption.

dafny-annotator: AI-Assisted Verification of Dafny Programs

TL;DR

Using a combination of Large Language Models and search to build dafny-annotator: a tool that adds logical annotations to a Dafny method until the verifier can prove it correct, which suggests a path towards capable AI assistants for languages that don't yet have large-scale human-generated examples.

Abstract

Formal verification has the potential to drastically reduce software bugs, but its high additional cost has hindered large-scale adoption. While Dafny presents a promise to significantly reduce the effort to write verified programs, users are often required to provide logical annotations to aid the verifier. Here, we explore using a combination of Large Language Models and search to build dafny-annotator: a tool that adds logical annotations to a Dafny method until the verifier can prove it correct. On a test set from the DafnyBench collection of programs, greedy search guided by LLaMa 3.1 8B successfully annotates only 15.7% of the methods. Since this data-driven approach is hindered by the lack of large-scale training data, we propose a method for open-ended synthesis of new Dafny programs in a flexible pipeline where LLMs formulate high-level ideas, implement them, and incrementally propose changes to existing programs, which Dafny validates. This gives us a synthetic dataset, DafnySynth, which we use to augment DafnyBench for training. Fine-tuning on both datasets boosts LLaMa 8B's success rate to 50.6% -- significantly better than the base model, or training on either dataset alone. Our results suggest a path towards capable AI assistants for languages that don't yet have large-scale human-generated examples. In turn, such assistants might reduce friction for users and ultimately drive adoption.

Paper Structure

This paper contains 11 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Example of an annotation prediction prompt using an LLM. This format is used by dafny-annotator to sample candidate annotations from an LLM, which it then tries to insert in the program. For training dafny-annotator, we extract examples in this format from existing annotated programs, where we remove one annotation and train the LLM to predict it (highlighted portion of the example).
  • Figure 2: Example taken from DafnySynth, of a sequence of edits that produces a fully verified method for training dafny-annotator. The root node at the top has no content, and is used to initialize the edit graph. The "Idea Proposer" creates nodes from the root node by asking an LLM to write a high-level idea of a verified program one might implement. Then, the "Idea Implementer" takes idea nodes as inputs, and again asks an LLM to write a small Dafny program related to that description. Here, it produces all lines except for the highlighted annotations (10 and 19). It checks that the resulting program is valid and that all of its annotations are accepted by Dafny, though initially the program does not fully verify. Then, an "Annotator" editor can take program nodes that don't fully verify, and attempt to add annotations to them. Here, GPT-4o adds an invariant and an assertion (highlighted lines), resulting in a fully verified program. This new node can still be used further by other editors as the basis or larger, more complex programs, that can now call this method or extend it.