Table of Contents
Fetching ...

VERINA: Benchmarking Verifiable Code Generation

Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, Dawn Song

TL;DR

Verina introduces a comprehensive benchmark for verifiable code generation in Lean, jointly evaluating code, specifications, and proofs across 189 manually curated tasks. By combining a modular data format, rigorous quality assurance, and a multi-stage evaluation pipeline, Verina enables robust measurement of soundness, completeness, and end-to-end verification. Experimental results reveal substantial gaps in current LLMs, particularly in ProofGen, with the best general-purpose model achieving only modest success and specialized provers offering improvements—especially when paired with iterative Lean verifier feedback. The benchmark’s compositional design, along with targeted experiments and explicit metric definitions, provides a principled framework to drive progress toward reliable, formally verified automated programming systems, and the authors release data and tooling to facilitate reproducibility and adoption.

Abstract

Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often focus on only individual components rather than providing a holistic evaluation framework of all tasks. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, achieves a 61.4\% code correctness rate, 51.0\% for specification soundness and completeness, and a mere 3.6\% proof success rate (based on one trial per task). We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.

VERINA: Benchmarking Verifiable Code Generation

TL;DR

Verina introduces a comprehensive benchmark for verifiable code generation in Lean, jointly evaluating code, specifications, and proofs across 189 manually curated tasks. By combining a modular data format, rigorous quality assurance, and a multi-stage evaluation pipeline, Verina enables robust measurement of soundness, completeness, and end-to-end verification. Experimental results reveal substantial gaps in current LLMs, particularly in ProofGen, with the best general-purpose model achieving only modest success and specialized provers offering improvements—especially when paired with iterative Lean verifier feedback. The benchmark’s compositional design, along with targeted experiments and explicit metric definitions, provides a principled framework to drive progress toward reliable, formally verified automated programming systems, and the authors release data and tooling to facilitate reproducibility and adoption.

Abstract

Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often focus on only individual components rather than providing a holistic evaluation framework of all tasks. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, achieves a 61.4\% code correctness rate, 51.0\% for specification soundness and completeness, and a mere 3.6\% proof success rate (based on one trial per task). We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.

Paper Structure

This paper contains 20 sections, 25 figures, 8 tables.

Figures (25)

  • Figure 1: An example instance of Verina, consisting of a problem description, code implementation, specifications (pre-condition and post-condition), a proof (optional), and comprehensive test cases. Note that we select this instance for presentation purposes and Verina contains more difficult ones.
  • Figure 2: Verina's three foundational tasks. Dashed arrows represent optional inputs.
  • Figure 3: Our evaluator for specification generation.
  • Figure 4: Combinations of Verina's foundational tasks: specification-guided code generation (top left), specification inference from code (bottom left), and end-to-end verifiable code generation (right). Natural language descriptions and function signatures are omitted in the figure for brevity.
  • Figure 5: pass@$1$ performance of LLMs on Verina's three foundational tasks.
  • ...and 20 more figures