Table of Contents
Fetching ...

VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean

Yutong Xin, Qiaochu Chen, Greg Durrett, Işil Dillig

TL;DR

VeriSoftBench is introduced, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies and results show that success is strongly correlated with transitive repository dependence.

Abstract

Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies. Our evaluation of frontier LLMs and specialized provers yields three observations. First, provers tuned for Mathlib-style mathematics transfer poorly to this repository-centric setting. Second, success is strongly correlated with transitive repository dependence: tasks whose proofs draw on large, multi-hop dependency closures are less likely to be solved. Third, providing curated context restricted to a proof's dependency closure improves performance relative to exposing the full repository, but nevertheless leaves substantial room for improvement. Our benchmark and evaluation suite are released at https://github.com/utopia-group/VeriSoftBench.

VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean

TL;DR

VeriSoftBench is introduced, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies and results show that success is strongly correlated with transitive repository dependence.

Abstract

Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies. Our evaluation of frontier LLMs and specialized provers yields three observations. First, provers tuned for Mathlib-style mathematics transfer poorly to this repository-centric setting. Second, success is strongly correlated with transitive repository dependence: tasks whose proofs draw on large, multi-hop dependency closures are less likely to be solved. Third, providing curated context restricted to a proof's dependency closure improves performance relative to exposing the full repository, but nevertheless leaves substantial room for improvement. Our benchmark and evaluation suite are released at https://github.com/utopia-group/VeriSoftBench.
Paper Structure (17 sections, 6 figures, 5 tables)

This paper contains 17 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Contextual dependencies comparison between (a) mathematical benchmark proofs, (b) lightweight verification tasks, and (c) repository-scale verification. PutnamBench proofs rely almost entirely on library (Mathlib) dependencies (purple). Verina introduces a small number of project-specific definitions (yellow) but remains largely library-driven. In contrast, proofs from a real-world ZK circuit DSL repository clean depend heavily on interconnected local abstractions, forming a dense repository-level dependency graph. Purple denotes library dependencies; yellow denotes local/repository dependencies.
  • Figure 2: Lean repository distribution across topics
  • Figure 3: An example task instance from our benchmark. The goal is to synthesize a proof for the target theorem cexec_to_reds, which relates two definitions of program execution in a formalized programming language. The figure illustrates the context that must be provided to or retrieved by the prover: Local File Context includes language syntax and semantic rules, while Repository Dependencies provides base types and generic lemmas. Ellipses ("...") indicate that each context source contains many additional definitions. Color annotations trace dependencies from the theorem statement to its relevant context, highlighting the challenge: the prover must identify relevant definitions from a large local file and connect domain-specific rules with generic library lemmas.
  • Figure 4: Evaluation Pipeline.
  • Figure 5: Although the target theorem is for 32-bit unsigned integers (U32), the context provides a structurally identical U64 soundness proof that factors bitwise equalities into a helper lemma and then applies it. Even though the actual soundness_to_u32 helper is not provided, this recurring abstraction pattern allows the model to anticipate its existence and shape and to replicate the same proof structure for the smaller word size.
  • ...and 1 more figures