Table of Contents
Fetching ...

Clover: Closed-Loop Verifiable Code Generation

Chuyue Sun, Ying Sheng, Oded Padon, Clark Barrett

TL;DR

Clover tackles the reliability gap in AI-generated code by pairing generation with a rigorous verification phase built on six cross-component consistency checks among code, formal annotations, and docstrings. By reconciling generation with deductive verification (via Dafny) and reconstruction-based testing (driven by GPT-4), Clover demonstrates high acceptance rates for correct artifacts ($ ext{up to }$ $87$% on ground-truth with multiple reconstruction attempts) and zero false positives on adversarial cases, plus meaningful improvements on external datasets. The work introduces the CloverBench dataset of annotated Dafny programs to benchmark this closed-loop approach and provides preliminary evidence that the paradigm generalizes to other verifiers (e.g., Verus). Overall, Clover offers a scalable path toward automatically generating formally verified code, with practical implications for safer AI-assisted software development.

Abstract

The use of large language models for code generation is a rapidly growing trend in software development. However, without effective methods for ensuring the correctness of generated code, this trend could lead to undesirable outcomes. In this paper, we introduce a new approach for addressing this challenge: the Clover paradigm, short for Closed-Loop Verifiable Code Generation, which uses consistency checking to provide a strong filter for incorrect code. Clover performs consistency checks among code, docstrings, and formal annotations. The checker is implemented using a novel integration of formal verification tools and large language models. We provide a theoretical analysis to support our thesis that Clover should be effective at consistency checking. We also empirically investigate its performance on a hand-designed dataset (CloverBench) featuring annotated Dafny programs at a textbook level of difficulty. Experimental results show that for this dataset: (i) LLMs are reasonably successful at automatically generating formal specifications; and (ii) our consistency checker achieves a promising acceptance rate (up to 87%) for correct instances while maintaining zero tolerance for adversarial incorrect ones (no false positives). Clover also discovered 6 incorrect programs in the existing human-written dataset MBPP-DFY-50.

Clover: Closed-Loop Verifiable Code Generation

TL;DR

Clover tackles the reliability gap in AI-generated code by pairing generation with a rigorous verification phase built on six cross-component consistency checks among code, formal annotations, and docstrings. By reconciling generation with deductive verification (via Dafny) and reconstruction-based testing (driven by GPT-4), Clover demonstrates high acceptance rates for correct artifacts ( % on ground-truth with multiple reconstruction attempts) and zero false positives on adversarial cases, plus meaningful improvements on external datasets. The work introduces the CloverBench dataset of annotated Dafny programs to benchmark this closed-loop approach and provides preliminary evidence that the paradigm generalizes to other verifiers (e.g., Verus). Overall, Clover offers a scalable path toward automatically generating formally verified code, with practical implications for safer AI-assisted software development.

Abstract

The use of large language models for code generation is a rapidly growing trend in software development. However, without effective methods for ensuring the correctness of generated code, this trend could lead to undesirable outcomes. In this paper, we introduce a new approach for addressing this challenge: the Clover paradigm, short for Closed-Loop Verifiable Code Generation, which uses consistency checking to provide a strong filter for incorrect code. Clover performs consistency checks among code, docstrings, and formal annotations. The checker is implemented using a novel integration of formal verification tools and large language models. We provide a theoretical analysis to support our thesis that Clover should be effective at consistency checking. We also empirically investigate its performance on a hand-designed dataset (CloverBench) featuring annotated Dafny programs at a textbook level of difficulty. Experimental results show that for this dataset: (i) LLMs are reasonably successful at automatically generating formal specifications; and (ii) our consistency checker achieves a promising acceptance rate (up to 87%) for correct instances while maintaining zero tolerance for adversarial incorrect ones (no false positives). Clover also discovered 6 incorrect programs in the existing human-written dataset MBPP-DFY-50.
Paper Structure (34 sections, 1 theorem, 3 figures, 23 tables, 1 algorithm)

This paper contains 34 sections, 1 theorem, 3 figures, 23 tables, 1 algorithm.

Key Result

theorem 3

Under Assumptions asmp:gtconsistency (Consistency) and asmp:transfer-function (Concentration), consider $(x, y)$ sampled from $A\times B$ according to $\mathcal{D}$ conditioned on $(x, y)\in G$; the single-edge Clover consistency check will accept $(x, y)$ with probability $A\geq l \cdot p_c \cdot

Figures (3)

  • Figure 1: The Clover paradigm
  • Figure 2: Generation phase feasibility study
  • Figure : Clover Consistency Check ($k=1$)

Theorems & Definitions (4)

  • definition thmcounterdefinition: Transfer Model
  • definition thmcounterdefinition: Transfer-Rational Model
  • definition thmcounterdefinition: Single-Edge Clover Consistency Check
  • theorem 3