Table of Contents
Fetching ...

Can LLMs Enable Verification in Mainstream Programming?

Aleksandr Shefer, Igor Engel, Stanislav Alekseev, Daniil Berezun, Ekaterina Verbitskaia, Anton Podkopaev

TL;DR

This work addresses the challenge of bringing formal verification into mainstream software development by evaluating whether large language models can generate correct-by-construction code for three verification backends: Dafny, Nagini, and Verus. It introduces a universal, prompt-driven pipeline with task preparation, iterative LLM interaction, and SMT-based validation, evaluated on manually curated HumanEval-derived datasets. Results show that Dafny yields the highest verified-code rates, with Nagini performing well in several modes and Verus lagging due to expressiveness; a five-iteration refinement strategy improves outcomes, though mode6 remains difficult. The findings suggest a viable path toward wider adoption of formal methods in general programming and point to future work in targeted fine-tuning, error-correction mechanisms, and broader language support.

Abstract

Although formal methods are capable of producing reliable software, they have seen minimal adoption in everyday programming. Automatic code generation using large language models is becoming increasingly widespread, but it rarely considers producing strong correctness guarantees. In this study, we explore the ability of LLMs to produce verified code in three verification languages (Dafny, Nagini, and Verus). To do so, we use manually curated datasets derived from the state-ofthe-art Python benchmark, HumanEval. We also assess what types of information are sufficient to achieve good-quality results.

Can LLMs Enable Verification in Mainstream Programming?

TL;DR

This work addresses the challenge of bringing formal verification into mainstream software development by evaluating whether large language models can generate correct-by-construction code for three verification backends: Dafny, Nagini, and Verus. It introduces a universal, prompt-driven pipeline with task preparation, iterative LLM interaction, and SMT-based validation, evaluated on manually curated HumanEval-derived datasets. Results show that Dafny yields the highest verified-code rates, with Nagini performing well in several modes and Verus lagging due to expressiveness; a five-iteration refinement strategy improves outcomes, though mode6 remains difficult. The findings suggest a viable path toward wider adoption of formal methods in general programming and point to future work in targeted fine-tuning, error-correction mechanisms, and broader language support.

Abstract

Although formal methods are capable of producing reliable software, they have seen minimal adoption in everyday programming. Automatic code generation using large language models is becoming increasingly widespread, but it rarely considers producing strong correctness guarantees. In this study, we explore the ability of LLMs to produce verified code in three verification languages (Dafny, Nagini, and Verus). To do so, we use manually curated datasets derived from the state-ofthe-art Python benchmark, HumanEval. We also assess what types of information are sufficient to achieve good-quality results.

Paper Structure

This paper contains 9 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The overview of evaluating an LLM on a benchmark problem
  • Figure 2: The dark-yellow background highlights the parts of Dafny code required to be filled by an LLM in different modes.
  • Figure 3: The general structure of a prompt
  • Figure 4: Venn Diagrams showing the intersection of unique programs successfully generated in at least one of five attempts
  • Figure 5: The example of Verus code, causing the type error