CLEVER: A Curated Benchmark for Formally Verified Code Generation

Amitayush Thakur; Jasper Lee; George Tsoukalas; Meghana Sistla; Matthew Zhao; Stefan Zetzsche; Greg Durrett; Yisong Yue; Swarat Chaudhuri

CLEVER: A Curated Benchmark for Formally Verified Code Generation

Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, Swarat Chaudhuri

TL;DR

Clever presents a curated Lean 4 benchmark with 161 end-to-end verified-code-generation tasks derived from HumanEval, enforcing completeness through non-computable specifications to prevent leakage. The evaluation adopts a staged pipeline—specification certification followed by implementation certification—with Lean's type checker serving as the arbiter of correctness, and reports that contemporary LLMs struggle to achieve end-to-end verification. By isolating spec generation from implementation correctness, Clever enables fine-grained analysis of where models fail and motivates the development of neural-symbolic verification strategies. The benchmark thus sets a high bar for verified code synthesis and provides a reproducible platform for advancing formal reasoning in AI systems.

Abstract

We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, ${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).

CLEVER: A Curated Benchmark for Formally Verified Code Generation

TL;DR

Abstract

We introduce

, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks,

avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use

to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).

CLEVER: A Curated Benchmark for Formally Verified Code Generation

TL;DR

Abstract

CLEVER: A Curated Benchmark for Formally Verified Code Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)