Table of Contents
Fetching ...

Reducing the Costs of Proof Synthesis on Rust Systems by Scaling Up a Seed Training Set

Nongyu Di, Tianyu Chen, Shan Lu, Shuai Lu, Yeyun Gong, Peng Cheng, Jacob R. Lorch, Yuan Yao, Xiaoxing Ma

TL;DR

VeruSyn presents a scalable data-synthesis pipeline to reduce proof synthesis costs for Verus-backed Rust verification. By combining self-synthesis, tutorial-based data augmentation, and agent-trajectories with long CoT data, it creates a 6.9 million-verification program corpus plus 4,557 CoT traces, enabling a compact model (Qwen2.5-Coder-32B-Instruct) to approach the proof-generation performance of large models at a fraction of the cost. The approach yields substantial gains in Verus-feature coverage and proof accuracy on real-world benchmarks (VerusBench and VeruSAGE-Bench) relative to baselines, while maintaining a strong cost advantage over commercial models. These results demonstrate the practicality of training smaller LLMs for formal verification tasks and offer a scalable pathway for broader adoption of verified Rust software. The work highlights the importance of diverse verification features and reasoning trajectories in training data for formal-methods-aware LLMs.

Abstract

Large Language Models (LLMs) are widely used for code generation. However, the correctness of code generated by LLMs remains a concern. A potential remedy to this concern is to have LLMs generate formal correctness proofs along with such code. However, compared with code generation, code-proof generation requires much higher reasoning capability and has much less existing data to learn from. In this paper, we present VeruSyn, a data synthesis pipeline for Verus, a state-of-the-art verification tool for system software written in Rust. Through self-synthesis and tutorial-based synthesis, VeruSyn achieves much larger scale and Verus-feature coverage than previous data-synthesis techniques designed for Verus; VeruSyn also supplements its dataset with long-chain-of-thought (CoT) data through agent trajectory synthesis. With VeruSyn, we synthesize the largest set of Verus verified programs: 6.9 million Rust programs, each with a formal specification and a proof that it meets that specification. This dataset lets us create a fine-tuned Qwen2.5-Coder-32B-Instruct model with appealing cost-proof tradeoff compared with state-of-the-art commercial models like Claude Sonnet 4.5. It also significantly outperforms models like o4-mini and previously proposed research models.

Reducing the Costs of Proof Synthesis on Rust Systems by Scaling Up a Seed Training Set

TL;DR

VeruSyn presents a scalable data-synthesis pipeline to reduce proof synthesis costs for Verus-backed Rust verification. By combining self-synthesis, tutorial-based data augmentation, and agent-trajectories with long CoT data, it creates a 6.9 million-verification program corpus plus 4,557 CoT traces, enabling a compact model (Qwen2.5-Coder-32B-Instruct) to approach the proof-generation performance of large models at a fraction of the cost. The approach yields substantial gains in Verus-feature coverage and proof accuracy on real-world benchmarks (VerusBench and VeruSAGE-Bench) relative to baselines, while maintaining a strong cost advantage over commercial models. These results demonstrate the practicality of training smaller LLMs for formal verification tasks and offer a scalable pathway for broader adoption of verified Rust software. The work highlights the importance of diverse verification features and reasoning trajectories in training data for formal-methods-aware LLMs.

Abstract

Large Language Models (LLMs) are widely used for code generation. However, the correctness of code generated by LLMs remains a concern. A potential remedy to this concern is to have LLMs generate formal correctness proofs along with such code. However, compared with code generation, code-proof generation requires much higher reasoning capability and has much less existing data to learn from. In this paper, we present VeruSyn, a data synthesis pipeline for Verus, a state-of-the-art verification tool for system software written in Rust. Through self-synthesis and tutorial-based synthesis, VeruSyn achieves much larger scale and Verus-feature coverage than previous data-synthesis techniques designed for Verus; VeruSyn also supplements its dataset with long-chain-of-thought (CoT) data through agent trajectory synthesis. With VeruSyn, we synthesize the largest set of Verus verified programs: 6.9 million Rust programs, each with a formal specification and a proof that it meets that specification. This dataset lets us create a fine-tuned Qwen2.5-Coder-32B-Instruct model with appealing cost-proof tradeoff compared with state-of-the-art commercial models like Claude Sonnet 4.5. It also significantly outperforms models like o4-mini and previously proposed research models.
Paper Structure (27 sections, 11 figures, 9 tables, 1 algorithm)

This paper contains 27 sections, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: Keyword Distribution of SAFE Dataset.
  • Figure 2: Keyword Distribution of AlphaVerus Dataset.
  • Figure 3: Keyword Distribution of VeruSAGE-Bench.
  • Figure 5: Overview of Our Approach
  • Figure 6: Keyword Distribution of VeruSyn's Part-1 Data. Even the least frequent keyword appears 8.6K times.
  • ...and 6 more figures