Clover: Closed-Loop Verifiable Code Generation

Chuyue Sun; Ying Sheng; Oded Padon; Clark Barrett

Clover: Closed-Loop Verifiable Code Generation

Chuyue Sun, Ying Sheng, Oded Padon, Clark Barrett

TL;DR

Clover tackles the reliability gap in AI-generated code by pairing generation with a rigorous verification phase built on six cross-component consistency checks among code, formal annotations, and docstrings. By reconciling generation with deductive verification (via Dafny) and reconstruction-based testing (driven by GPT-4), Clover demonstrates high acceptance rates for correct artifacts ($ ext{up to }$ $87$% on ground-truth with multiple reconstruction attempts) and zero false positives on adversarial cases, plus meaningful improvements on external datasets. The work introduces the CloverBench dataset of annotated Dafny programs to benchmark this closed-loop approach and provides preliminary evidence that the paradigm generalizes to other verifiers (e.g., Verus). Overall, Clover offers a scalable path toward automatically generating formally verified code, with practical implications for safer AI-assisted software development.

Abstract

The use of large language models for code generation is a rapidly growing trend in software development. However, without effective methods for ensuring the correctness of generated code, this trend could lead to undesirable outcomes. In this paper, we introduce a new approach for addressing this challenge: the Clover paradigm, short for Closed-Loop Verifiable Code Generation, which uses consistency checking to provide a strong filter for incorrect code. Clover performs consistency checks among code, docstrings, and formal annotations. The checker is implemented using a novel integration of formal verification tools and large language models. We provide a theoretical analysis to support our thesis that Clover should be effective at consistency checking. We also empirically investigate its performance on a hand-designed dataset (CloverBench) featuring annotated Dafny programs at a textbook level of difficulty. Experimental results show that for this dataset: (i) LLMs are reasonably successful at automatically generating formal specifications; and (ii) our consistency checker achieves a promising acceptance rate (up to 87%) for correct instances while maintaining zero tolerance for adversarial incorrect ones (no false positives). Clover also discovered 6 incorrect programs in the existing human-written dataset MBPP-DFY-50.

Clover: Closed-Loop Verifiable Code Generation

TL;DR

% on ground-truth with multiple reconstruction attempts) and zero false positives on adversarial cases, plus meaningful improvements on external datasets. The work introduces the CloverBench dataset of annotated Dafny programs to benchmark this closed-loop approach and provides preliminary evidence that the paradigm generalizes to other verifiers (e.g., Verus). Overall, Clover offers a scalable path toward automatically generating formally verified code, with practical implications for safer AI-assisted software development.

Abstract

Paper Structure (34 sections, 1 theorem, 3 figures, 23 tables, 1 algorithm)

This paper contains 34 sections, 1 theorem, 3 figures, 23 tables, 1 algorithm.

Introduction
Preliminaries: Deductive Program Verification
Clover
Clover Generation Phase
Clover Verification Phase
Consistency Checking Example
Evaluation
Dataset: CloverBench
Dafny
Generation Phase
Verification Phase: Results on Ground-Truth Examples
Verification Phase: Results on MBPP-DFY-50
Verification Phase: Results on Adversarial Examples
A Preliminary Study with Verus
Related Work
...and 19 more sections

Key Result

theorem 3

Under Assumptions asmp:gtconsistency (Consistency) and asmp:transfer-function (Concentration), consider $(x, y)$ sampled from $A\times B$ according to $\mathcal{D}$ conditioned on $(x, y)\in G$; the single-edge Clover consistency check will accept $(x, y)$ with probability $A\geq l \cdot p_c \cdot

Figures (3)

Figure 1: The Clover paradigm
Figure 2: Generation phase feasibility study
Figure : Clover Consistency Check ($k=1$)

Theorems & Definitions (4)

definition thmcounterdefinition: Transfer Model
definition thmcounterdefinition: Transfer-Rational Model
definition thmcounterdefinition: Single-Edge Clover Consistency Check
theorem 3

Clover: Closed-Loop Verifiable Code Generation

TL;DR

Abstract

Clover: Closed-Loop Verifiable Code Generation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (4)