Table of Contents
Fetching ...

DafnyBench: A Benchmark for Formal Software Verification

Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, Max Tegmark

TL;DR

DafnyBench presents the largest ML-driven benchmark for formal software verification, focusing on auto-generating verification hints to drive the Dafny verifier across 782 ground-truth programs totaling about $53{,}000$ lines of code. The study empirically evaluates multiple LLMs, finding Claude 3 Opus as the top performer with around $67.8\%$ success, while showing that error-message feedback yields diminishing returns and that verification difficulty grows with program size and hint quantity. The work highlights the potential of LLM-assisted verifiers and auto-synthesized proofs, while acknowledging limitations such as data contamination and the scope of evaluation. These insights point to a path toward broader benchmarks, improved LLM-in-the-loop verifiers, and automated synthesis workflows that could substantially lower the cost and increase the adoption of formal verification in software development.

Abstract

We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.

DafnyBench: A Benchmark for Formal Software Verification

TL;DR

DafnyBench presents the largest ML-driven benchmark for formal software verification, focusing on auto-generating verification hints to drive the Dafny verifier across 782 ground-truth programs totaling about lines of code. The study empirically evaluates multiple LLMs, finding Claude 3 Opus as the top performer with around success, while showing that error-message feedback yields diminishing returns and that verification difficulty grows with program size and hint quantity. The work highlights the potential of LLM-assisted verifiers and auto-synthesized proofs, while acknowledging limitations such as data contamination and the scope of evaluation. These insights point to a path toward broader benchmarks, improved LLM-in-the-loop verifiers, and automated synthesis workflows that could substantially lower the cost and increase the adoption of formal verification in software development.

Abstract

We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.
Paper Structure (33 sections, 2 equations, 12 figures, 6 tables)

This paper contains 33 sections, 2 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Overview of evaluating LLM on a DafnyBench test program.
  • Figure 2: An example ground_truth program that is fully verified with Dafny. To create the fill_hints task, we would remove the invariant lines from the program above.
  • Figure 3: Success rate vs. number of attempts given.
  • Figure 4: Mean success rate of each bin vs. program length (a), and mean success rate of each bin vs. hint quantity (b). The vertical lines indicate the bin boundaries used, where the bins have an almost uniform distribution of the programs. Note that the bins are different for the two metrics. For better visual clarity, the scales are adjusted for both plots and their $x$-axes do not start at $0$ character.
  • Figure 5: Pseudocode for the minhash deduplication algorithm.
  • ...and 7 more figures