Table of Contents
Fetching ...

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

Haoyu Zhao, Ziran Yang, Jiawei Li, Deyuan He, Zenan Li, Chi Jin, Venugopal V. Veeravalli, Aarti Gupta, Sanjeev Arora

TL;DR

AlgoVeri presents the first strictly aligned, multi-language benchmark for verified code generation across Dafny, Verus, and Lean, targeting 77 classical algorithms to disentangle problem difficulty from tool-chain effects. Using expert-curated specifications, formal well-formedness checks, and a multi-turn refinement pipeline with a semantic validator, the study reveals a clear language hierarchy: SMT-based Dafny currently affords the most automation, while Verus and Lean impose substantial syntactic and proof-strategy barriers. Frontier models can leverage iterative repair to improve verification outcomes (e.g., tripling pass rates in some languages), but open-weight models show early saturation and exhibit an algorithmic fidelity gap, including tendencies toward cheating or degenerating to simpler variants. The results argue for heterogeneity in verification approaches to push vericoding forward and establish AlgoVeri as a standardized, rigorous platform for cross-language evaluation and future progress in formal code generation. This benchmark thus has practical implications for building reliable, verified software through collaborative, multi-paradigm verification ecosystems.

Abstract

Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

TL;DR

AlgoVeri presents the first strictly aligned, multi-language benchmark for verified code generation across Dafny, Verus, and Lean, targeting 77 classical algorithms to disentangle problem difficulty from tool-chain effects. Using expert-curated specifications, formal well-formedness checks, and a multi-turn refinement pipeline with a semantic validator, the study reveals a clear language hierarchy: SMT-based Dafny currently affords the most automation, while Verus and Lean impose substantial syntactic and proof-strategy barriers. Frontier models can leverage iterative repair to improve verification outcomes (e.g., tripling pass rates in some languages), but open-weight models show early saturation and exhibit an algorithmic fidelity gap, including tendencies toward cheating or degenerating to simpler variants. The results argue for heterogeneity in verification approaches to push vericoding forward and establish AlgoVeri as a standardized, rigorous platform for cross-language evaluation and future progress in formal code generation. This benchmark thus has practical implications for building reliable, verified software through collaborative, multi-paradigm verification ecosystems.

Abstract

Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny (% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus (%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.
Paper Structure (40 sections, 6 figures, 3 tables)

This paper contains 40 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: (AlgoVeri benchmark pipeline.) The "textbook" algorithms and data structure with their pre and post-conditions are formalized into Dafny, Verus, and Lean. Then, the LLM being tested is prompted to implement and verify the specific algorithm or data structure operation, through an iterative refinement process with the compiler. If the generated code (implementation and proof) is successfully verified, an LLM judge check if the code implements the designated algorithm, serving as a semantic validator.
  • Figure 2: The composition of AlgoVeri Benchmark
  • Figure 3: Performance breakdown by algorithm category. The radar charts illustrate the "Complexity Cliff" in current model capabilities. While models achieve high proficiency in Basic Data Structures and Sorting (approaching the outer rim in Dafny), capability degrades sharply for Advanced Data Structures and Graph Algorithms (shrinking toward the center). This trend is most pronounced in Verus and Lean, where the compounding difficulty of global invariants and strict system constraints (ownership or constructive logic) renders high-complexity tasks nearly unsolvable.
  • Figure 4: Average Pass Rate across 14 Repair Rounds. We compare Gemini-3 Flash (solid) and GPT-OSS-120B (dashed). The frontier model (Gemini) achieves consistent improvement in both Dafny and Verus as repair depth increases, effectively converting test-time compute into verification success. In contrast, the open model (GPT-OSS) saturates early across all languages, showing minimal gains beyond Round 4.
  • Figure 5: The Efficiency Frontier of Iterative Repair (GPT-OSS-120B). We compare the efficacy of Depth (repair) versus Width (parallel sampling) under a strict iso-compute constraint (equal total generation budget). The Grey Dashed Line represents the pure sampling baseline. Observation: Across all languages, repair strategies (solid lines) fail to significantly outperform the sampling baseline, and deep repair (Blue, 15 Rounds) often degrades performance. This confirms that for current open models, repair is inefficient---allocating compute to depth yields no marginal gain over simple parallel sampling.
  • ...and 1 more figures