Table of Contents
Fetching ...

VerifyThisBench: Generating Code, Specifications, and Proofs All at Once

Xun Deng, Sicheng Zhong, Barış Bayazıt, Andreas Veneris, Fan Long, Xujie Si

TL;DR

This work addresses the lack of end-to-end verification benchmarks by introducing VerifyThisBench, which evaluates LLMs on translating natural-language problem statements into formal specifications, verification-ready code, and machine-checkable proofs. It provides VerifyThisBenchXS as a relaxed variant to study partial artifacts and refinement within a unified environment that integrates seven verification tools. The authors benchmark nine SOTA LLMs across 41 challenges (154 tasks) and 580 XS tasks, revealing consistently low pass rates in zero-shot settings but meaningful gains with iterative feedback, highlighting substantial gaps in current models' formal reasoning. The dataset, evaluation pipeline, and tool coverage aim to catalyze progress in LLM-assisted formal methods and trustworthy software generation.

Abstract

Large language models (LLMs) have demonstrated remarkable progress in code generation, but many existing benchmarks are approaching saturation and offer little guarantee on the trustworthiness of the generated programs. To improve visibility into model reasoning on formal correctness, we introduce VerifyThisBench, a new benchmark that evaluates end-to-end program verification from natural language descriptions: models must (i) extract formal specifications, (ii) implement in a verification-aware language, and (iii) construct machine checkable proofs. Our evaluation reveals that even state-of-the-art (SOTA) models, such as o3-mini, achieve a pass rate of less than 4%, with many outputs failing to compile. To isolate sources of difficulty, we further propose VerifyThisBenchXS, a relaxed variant in which partial implementations or proofs are provided. Across nine models and seven verification tools on both benchmarks, we observe consistent gains with feedback-driven refinement, but overall pass rates remain low, underscoring substantial gaps in formal reasoning. We release the benchmark and the unified evaluation environment to catalyze the verification capabilities for future models.

VerifyThisBench: Generating Code, Specifications, and Proofs All at Once

TL;DR

This work addresses the lack of end-to-end verification benchmarks by introducing VerifyThisBench, which evaluates LLMs on translating natural-language problem statements into formal specifications, verification-ready code, and machine-checkable proofs. It provides VerifyThisBenchXS as a relaxed variant to study partial artifacts and refinement within a unified environment that integrates seven verification tools. The authors benchmark nine SOTA LLMs across 41 challenges (154 tasks) and 580 XS tasks, revealing consistently low pass rates in zero-shot settings but meaningful gains with iterative feedback, highlighting substantial gaps in current models' formal reasoning. The dataset, evaluation pipeline, and tool coverage aim to catalyze progress in LLM-assisted formal methods and trustworthy software generation.

Abstract

Large language models (LLMs) have demonstrated remarkable progress in code generation, but many existing benchmarks are approaching saturation and offer little guarantee on the trustworthiness of the generated programs. To improve visibility into model reasoning on formal correctness, we introduce VerifyThisBench, a new benchmark that evaluates end-to-end program verification from natural language descriptions: models must (i) extract formal specifications, (ii) implement in a verification-aware language, and (iii) construct machine checkable proofs. Our evaluation reveals that even state-of-the-art (SOTA) models, such as o3-mini, achieve a pass rate of less than 4%, with many outputs failing to compile. To isolate sources of difficulty, we further propose VerifyThisBenchXS, a relaxed variant in which partial implementations or proofs are provided. Across nine models and seven verification tools on both benchmarks, we observe consistent gains with feedback-driven refinement, but overall pass rates remain low, underscoring substantial gaps in formal reasoning. We release the benchmark and the unified evaluation environment to catalyze the verification capabilities for future models.

Paper Structure

This paper contains 29 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 2: zero-shot on VerifyThisBench
  • Figure 3: refinement on VerifyThisBench
  • Figure 4: zero-shot on VerifyThisBenchXS
  • Figure 5: refinement on VerifyThisBenchXS
  • Figure 6: Distributions of dataset characteristics. (Left) Word count distribution of natural language descriptions for challenges. (Right) Line-of-code distribution of collected ground-truth solutions.
  • ...and 1 more figures