Table of Contents
Fetching ...

VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code

Lingfei Zeng, Fengdi Che, Xuhan Huang, Fei Ye, Xu Xu, Binhang Yuan, Jie Fu

TL;DR

VeriEquivBench addresses the bottleneck of ground-truth specification evaluation by introducing a formally grounded equivalence score that checks bidirectional code–spec alignment using the Dafny verifier, enabling ground-truth-free evaluation. It scales evaluation to 2,389 complex problems by autoformalizing LeetCode tasks and generating novel queries through a structured tag composition pipeline. The framework combines a NL-based equivalence check with a back-translation validation step, revealing that end-to-end verifiable code generation remains a major challenge for current LLMs despite improvements in NL translation. Empirically, state-of-the-art models achieve limited end-to-end success on VeriEquivBench, underscoring the need for scalable data generation and robust auto-evaluation to advance trustworthy, formally verifiable coding agents.

Abstract

Formal verification is the next frontier for ensuring the correctness of code generated by Large Language Models (LLMs). While methods that co-generate code and formal specifications in formal languages, like Dafny, can, in principle, prove alignment with user intent, progress is bottlenecked by specification quality evaluation. Current benchmarks rely on matching against ground-truth specifications, a manual and expertise-intensive process that has limited existing datasets to a few hundred simple problems and also suffers from a reliability issue. To address this, we introduce VeriEquivBench, a new benchmark with $2,389$ complex algorithmic problems that probe the limitations of current models in both code generation and formal reasoning. Our evaluation framework replaces ground-truth matching with a formally grounded metric, the equivalence score, and rigorously verifies the quality of generated specifications and code. Our results show that generating formally verifiable code remains a profound challenge for state-of-the-art LLMs. This underscores both the difficulty of the task and the need for benchmarks like VeriEquivBench to drive progress toward scalable and reliable coding agents.

VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code

TL;DR

VeriEquivBench addresses the bottleneck of ground-truth specification evaluation by introducing a formally grounded equivalence score that checks bidirectional code–spec alignment using the Dafny verifier, enabling ground-truth-free evaluation. It scales evaluation to 2,389 complex problems by autoformalizing LeetCode tasks and generating novel queries through a structured tag composition pipeline. The framework combines a NL-based equivalence check with a back-translation validation step, revealing that end-to-end verifiable code generation remains a major challenge for current LLMs despite improvements in NL translation. Empirically, state-of-the-art models achieve limited end-to-end success on VeriEquivBench, underscoring the need for scalable data generation and robust auto-evaluation to advance trustworthy, formally verifiable coding agents.

Abstract

Formal verification is the next frontier for ensuring the correctness of code generated by Large Language Models (LLMs). While methods that co-generate code and formal specifications in formal languages, like Dafny, can, in principle, prove alignment with user intent, progress is bottlenecked by specification quality evaluation. Current benchmarks rely on matching against ground-truth specifications, a manual and expertise-intensive process that has limited existing datasets to a few hundred simple problems and also suffers from a reliability issue. To address this, we introduce VeriEquivBench, a new benchmark with complex algorithmic problems that probe the limitations of current models in both code generation and formal reasoning. Our evaluation framework replaces ground-truth matching with a formally grounded metric, the equivalence score, and rigorously verifies the quality of generated specifications and code. Our results show that generating formally verifiable code remains a profound challenge for state-of-the-art LLMs. This underscores both the difficulty of the task and the need for benchmarks like VeriEquivBench to drive progress toward scalable and reliable coding agents.

Paper Structure

This paper contains 30 sections, 23 figures, 7 tables.

Figures (23)

  • Figure 1: An end-to-end verifiable coding agent first generates code and specifications, using the Dafny verifier to prove their mutual equivalence. It then translates the complete formal specification back into natural language, allowing the user to confirm that it aligns with their original intent.
  • Figure 2: The figure outlines our autoformalization and code generation workflow: Pipeline 1 produces comprehensive and syntax-free specifications; Pipeline 2 checks consistency between the NL query and the specifications; Pipeline 3 emits fully annotated code that passes the verifier.
  • Figure 3: We show an example where the equivalence score proves the given specifications are underspecified for returning the maximum between two integers. The code presents the statement to verify whether the specification implies the code.
  • Figure 4: We show an example where the equivalence score proves the given specifications are underspecified for returning the maximum between two integers. The code presents the statement to verify whether the specification implies the code.
  • Figure 5: An example of a weak specification in sample #625 that fails equivalence scoring. The formal specification is ambiguous as it omits a post-condition on the invariance of the array's length.
  • ...and 18 more figures