Can LLMs Enable Verification in Mainstream Programming?
Aleksandr Shefer, Igor Engel, Stanislav Alekseev, Daniil Berezun, Ekaterina Verbitskaia, Anton Podkopaev
TL;DR
This work addresses the challenge of bringing formal verification into mainstream software development by evaluating whether large language models can generate correct-by-construction code for three verification backends: Dafny, Nagini, and Verus. It introduces a universal, prompt-driven pipeline with task preparation, iterative LLM interaction, and SMT-based validation, evaluated on manually curated HumanEval-derived datasets. Results show that Dafny yields the highest verified-code rates, with Nagini performing well in several modes and Verus lagging due to expressiveness; a five-iteration refinement strategy improves outcomes, though mode6 remains difficult. The findings suggest a viable path toward wider adoption of formal methods in general programming and point to future work in targeted fine-tuning, error-correction mechanisms, and broader language support.
Abstract
Although formal methods are capable of producing reliable software, they have seen minimal adoption in everyday programming. Automatic code generation using large language models is becoming increasingly widespread, but it rarely considers producing strong correctness guarantees. In this study, we explore the ability of LLMs to produce verified code in three verification languages (Dafny, Nagini, and Verus). To do so, we use manually curated datasets derived from the state-ofthe-art Python benchmark, HumanEval. We also assess what types of information are sufficient to achieve good-quality results.
