Table of Contents
Fetching ...

CLEVER: A Curated Benchmark for Formally Verified Code Generation

Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, Swarat Chaudhuri

TL;DR

Clever presents a curated Lean 4 benchmark with 161 end-to-end verified-code-generation tasks derived from HumanEval, enforcing completeness through non-computable specifications to prevent leakage. The evaluation adopts a staged pipeline—specification certification followed by implementation certification—with Lean's type checker serving as the arbiter of correctness, and reports that contemporary LLMs struggle to achieve end-to-end verification. By isolating spec generation from implementation correctness, Clever enables fine-grained analysis of where models fail and motivates the development of neural-symbolic verification strategies. The benchmark thus sets a high bar for verified code synthesis and provides a reproducible platform for advancing formal reasoning in AI systems.

Abstract

We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, ${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).

CLEVER: A Curated Benchmark for Formally Verified Code Generation

TL;DR

Clever presents a curated Lean 4 benchmark with 161 end-to-end verified-code-generation tasks derived from HumanEval, enforcing completeness through non-computable specifications to prevent leakage. The evaluation adopts a staged pipeline—specification certification followed by implementation certification—with Lean's type checker serving as the arbiter of correctness, and reports that contemporary LLMs struggle to achieve end-to-end verification. By isolating spec generation from implementation correctness, Clever enables fine-grained analysis of where models fail and motivates the development of neural-symbolic verification strategies. The benchmark thus sets a high bar for verified code synthesis and provides a reproducible platform for advancing formal reasoning in AI systems.

Abstract

We introduce , a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).

Paper Structure

This paper contains 16 sections, 22 figures, 2 tables.

Figures (22)

  • Figure 1: The two tasks of the Clever benchmark pipeline. Task 1 requires first generating a specification $\psi$ from the natural language statement $\nu$, then proving an isomorphism between the generated specification and a human-written specification $\psi^*$. Task 2 requires first generating a Lean implementation $\pi$, then proving its correctness according to the human-written specification. Both of these tasks must be completed correctly (reaching both QED 1 and QED 2) in order for a success to be counted.
  • Figure 2: Two different specs for finding the $n^{th}$ Fibonacci number. (a) shows a computable specification that leaks the implementation; (b) shows a non-computable specification leading to no-leakage of the implementation and enforcing the model to learn the deeper logical inference.
  • Figure 3: Illustration of specification leakage (left) and its mitigation (right) via non-computable specifications, using HumanEval problem 72. The task is to return true iff a list q is a palindrome and its sum is at most w. In (a–c), the spec is computable: it encodes the desired logic in a Boolean expression, allowing the model to copy it directly in (b) and produce a trivial proof (c) via just unfolding and simplifying basic definitions used in the theorem statement. In contrast, (d–f) use a non-computable spec expressed in Prop with logical implications. The corresponding implementation (e), generated by GPT-4o using few-shot prompting, reflects the semantic intent without mirroring the spec. The proof (f) fails without additional reasoning, highlighting the challenge of proving correctness when logic cannot be mechanically unfolded. Non-computable specs thus act as guardrails, requiring models to reason rather than copy.
  • Figure 4: Polynomial Root-Finding. Problem 32 asks for an approximate real root of a degree-$n$ polynomial. The spec enforces small residual error ($<10^{-6}$). The implementation uses Newton's method with bounded recursion; proving termination is non-trivial due to lack of guaranteed derivative behavior.
  • Figure 5: Evaluation strategy: retry each generation step until Lean compilation succeeds or a timeout is reached.
  • ...and 17 more figures