Clover: Closed-Loop Verifiable Code Generation
Chuyue Sun, Ying Sheng, Oded Padon, Clark Barrett
TL;DR
Clover tackles the reliability gap in AI-generated code by pairing generation with a rigorous verification phase built on six cross-component consistency checks among code, formal annotations, and docstrings. By reconciling generation with deductive verification (via Dafny) and reconstruction-based testing (driven by GPT-4), Clover demonstrates high acceptance rates for correct artifacts ($ ext{up to }$ $87$% on ground-truth with multiple reconstruction attempts) and zero false positives on adversarial cases, plus meaningful improvements on external datasets. The work introduces the CloverBench dataset of annotated Dafny programs to benchmark this closed-loop approach and provides preliminary evidence that the paradigm generalizes to other verifiers (e.g., Verus). Overall, Clover offers a scalable path toward automatically generating formally verified code, with practical implications for safer AI-assisted software development.
Abstract
The use of large language models for code generation is a rapidly growing trend in software development. However, without effective methods for ensuring the correctness of generated code, this trend could lead to undesirable outcomes. In this paper, we introduce a new approach for addressing this challenge: the Clover paradigm, short for Closed-Loop Verifiable Code Generation, which uses consistency checking to provide a strong filter for incorrect code. Clover performs consistency checks among code, docstrings, and formal annotations. The checker is implemented using a novel integration of formal verification tools and large language models. We provide a theoretical analysis to support our thesis that Clover should be effective at consistency checking. We also empirically investigate its performance on a hand-designed dataset (CloverBench) featuring annotated Dafny programs at a textbook level of difficulty. Experimental results show that for this dataset: (i) LLMs are reasonably successful at automatically generating formal specifications; and (ii) our consistency checker achieves a promising acceptance rate (up to 87%) for correct instances while maintaining zero tolerance for adversarial incorrect ones (no false positives). Clover also discovered 6 incorrect programs in the existing human-written dataset MBPP-DFY-50.
