Table of Contents
Fetching ...

Automated Proof Generation for Rust Code via Self-Evolution

Tianyu Chen, Shuai Lu, Shan Lu, Yeyun Gong, Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Hao Yu, Nan Duan, Peng Cheng, Fan Yang, Shuvendu K Lahiri, Tao Xie, Lidong Zhou

TL;DR

The paper tackles the problem of automating formal proofs for Rust code in the face of scarce human-proof data. It introduces SAFE, a self-evolving framework that couples data synthesis, model fine-tuning, and a symbolic verifier (Verus) to generate and validate proofs, while repurposing incorrect proofs to train self-debugging. SAFE bootstraps with GPT-4o to translate and generate Verus-compatible code and specifications, then iteratively refines open-source models through two intertwined tasks: specification synthesis and proof synthesis, guided by correctness and completeness metrics. Empirical results on VerusBench and CodeNet-Test show that SAFE significantly outperforms GPT-4o-based baselines, with self-debugging further boosting accuracy to the high-70s/low-80s on VerusBench and approaching 50% on CodeNet-Test, demonstrating the practical potential of automated, data-driven formal verification for Rust. The work highlights the importance of specification quality and demonstrates that a self-evolving data loop can substantially improve open-model capabilities for complex verification tasks.

Abstract

Ensuring correctness is crucial for code generation. Formal verification offers a definitive assurance of correctness, but demands substantial human effort in proof construction and hence raises a pressing need for automation. The primary obstacle lies in the severe lack of data-there is much fewer proofs than code snippets for Large Language Models (LLMs) to train upon. In this paper, we introduce SAFE, a framework that overcomes the lack of human-written proofs to enable automated proof generation of Rust code. SAFE establishes a self-evolving cycle where data synthesis and fine-tuning collaborate to enhance the model capability, leveraging the definitive power of a symbolic verifier in telling correct proofs from incorrect ones. SAFE also re-purposes the large number of synthesized incorrect proofs to train the self-debugging capability of the fine-tuned models, empowering them to fix incorrect proofs based on the verifier's feedback. SAFE demonstrates superior efficiency and precision compared to GPT-4o. Through tens of thousands of synthesized proofs and the self-debugging mechanism, we improve the capability of open-source models, initially unacquainted with formal verification, to automatically write proofs for Rust code. This advancement leads to a significant improvement in performance, achieving a 52.52% accuracy rate in a benchmark crafted by human experts, a significant leap over GPT-4o's performance of 14.39%.

Automated Proof Generation for Rust Code via Self-Evolution

TL;DR

The paper tackles the problem of automating formal proofs for Rust code in the face of scarce human-proof data. It introduces SAFE, a self-evolving framework that couples data synthesis, model fine-tuning, and a symbolic verifier (Verus) to generate and validate proofs, while repurposing incorrect proofs to train self-debugging. SAFE bootstraps with GPT-4o to translate and generate Verus-compatible code and specifications, then iteratively refines open-source models through two intertwined tasks: specification synthesis and proof synthesis, guided by correctness and completeness metrics. Empirical results on VerusBench and CodeNet-Test show that SAFE significantly outperforms GPT-4o-based baselines, with self-debugging further boosting accuracy to the high-70s/low-80s on VerusBench and approaching 50% on CodeNet-Test, demonstrating the practical potential of automated, data-driven formal verification for Rust. The work highlights the importance of specification quality and demonstrates that a self-evolving data loop can substantially improve open-model capabilities for complex verification tasks.

Abstract

Ensuring correctness is crucial for code generation. Formal verification offers a definitive assurance of correctness, but demands substantial human effort in proof construction and hence raises a pressing need for automation. The primary obstacle lies in the severe lack of data-there is much fewer proofs than code snippets for Large Language Models (LLMs) to train upon. In this paper, we introduce SAFE, a framework that overcomes the lack of human-written proofs to enable automated proof generation of Rust code. SAFE establishes a self-evolving cycle where data synthesis and fine-tuning collaborate to enhance the model capability, leveraging the definitive power of a symbolic verifier in telling correct proofs from incorrect ones. SAFE also re-purposes the large number of synthesized incorrect proofs to train the self-debugging capability of the fine-tuned models, empowering them to fix incorrect proofs based on the verifier's feedback. SAFE demonstrates superior efficiency and precision compared to GPT-4o. Through tens of thousands of synthesized proofs and the self-debugging mechanism, we improve the capability of open-source models, initially unacquainted with formal verification, to automatically write proofs for Rust code. This advancement leads to a significant improvement in performance, achieving a 52.52% accuracy rate in a benchmark crafted by human experts, a significant leap over GPT-4o's performance of 14.39%.

Paper Structure

This paper contains 37 sections, 3 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: The SAFE Framework
  • Figure 2: LLaMa
  • Figure 3: DeepSeekCoder
  • Figure 5: The distribution of the number of tokens in preconditions
  • Figure 6: The distribution of the number of tokens in postconditions
  • ...and 2 more figures