Table of Contents
Fetching ...

Can LLMs Reason Like Automated Theorem Provers for Rust Verification? VCoT-Bench: Evaluating via Verification Chain of Thought

Zichen Xie, Wenxi Wang

Abstract

As Large Language Models (LLMs) increasingly assist secure software development, their ability to meet the rigorous demands of Rust program verification remains unclear. Existing evaluations treat Rust verification as a black box, assessing models only by binary pass or fail outcomes for proof hints. This obscures whether models truly understand the logical deductions required for verifying nontrivial Rust code. To bridge this gap, we introduce VCoT-Lift, a framework that lifts low-level solver reasoning into high-level, human-readable verification steps. By exposing solver-level reasoning as an explicit Verification Chain-of-Thought, VCoT-Lift provides a concrete ground truth for fine-grained evaluation. Leveraging VCoT-Lift, we introduce VCoT-Bench, a comprehensive benchmark of 1,988 VCoT completion tasks for rigorously evaluating LLMs' understanding of the entire verification process. VCoT-Bench measures performance along three orthogonal dimensions: robustness to varying degrees of missing proofs, competence across different proof types, and sensitivity to the proof locations. Evaluation of ten state-of-the-art models reveals severe fragility, indicating that current LLMs fall well short of the reasoning capabilities exhibited by automated theorem provers.

Can LLMs Reason Like Automated Theorem Provers for Rust Verification? VCoT-Bench: Evaluating via Verification Chain of Thought

Abstract

As Large Language Models (LLMs) increasingly assist secure software development, their ability to meet the rigorous demands of Rust program verification remains unclear. Existing evaluations treat Rust verification as a black box, assessing models only by binary pass or fail outcomes for proof hints. This obscures whether models truly understand the logical deductions required for verifying nontrivial Rust code. To bridge this gap, we introduce VCoT-Lift, a framework that lifts low-level solver reasoning into high-level, human-readable verification steps. By exposing solver-level reasoning as an explicit Verification Chain-of-Thought, VCoT-Lift provides a concrete ground truth for fine-grained evaluation. Leveraging VCoT-Lift, we introduce VCoT-Bench, a comprehensive benchmark of 1,988 VCoT completion tasks for rigorously evaluating LLMs' understanding of the entire verification process. VCoT-Bench measures performance along three orthogonal dimensions: robustness to varying degrees of missing proofs, competence across different proof types, and sensitivity to the proof locations. Evaluation of ten state-of-the-art models reveals severe fragility, indicating that current LLMs fall well short of the reasoning capabilities exhibited by automated theorem provers.
Paper Structure (27 sections, 1 equation, 10 figures, 5 tables)

This paper contains 27 sections, 1 equation, 10 figures, 5 tables.

Figures (10)

  • Figure 1: (a). The program is about replacing the last element of the first vector (first) with all elements of the second (second). Executable Rust code, specifications, and proof hints are highlighted. The specifications define the precondition (requires) and postcondition (ensures), and verification relies on three parts of proof hints: two blocks of loop invariants and one assertion. (b) Shows the Z3 proof of the running example with high-level, medium-level, and low-level proofs highlighted. The high-level rule unit-resolution (line 491) establishes the property $|seq.subrange(s, j, k)| = k - j$, while medium-level rules such as monotonicity and rewrite (line 622) prove index bounds. Low-level rules encode trivial reasoning, e.g., refl (lines 653–654), which simply asserts $x775 = $x775.
  • Figure 2: Overview of VCoT-Lift.
  • Figure 3: The intermediate VCoT for the running example’s second while loop across stages of VCoT-Lift. ① Shows the Verus program with original proof hints and transformed proofs; after three transformer–checker loops, several assertions and a lemma establishing a key property s.subrange(0, s.len()) == s are introduced. ② Shows the pruned proofs. ③ Shows the repaired proofs, fixing a syntactic error.
  • Figure 4: Conceptual visualization of: ① Aggregated Z3 proof segments for the running example, where mp and lemma segments are grouped and passed to the corresponding checkers; ② the mapping mechanism, where the modus ponens checker identifies a Z3 proof without a corresponding Verus-level proof and marks the transformation incomplete.
  • Figure 5: Comparison of Total Proof Lines per Program Between VCoT-Bench-Org and Verus-Bench
  • ...and 5 more figures