Table of Contents
Fetching ...

FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving

Xiaohan Lin, Qingxing Cao, Yinya Huang, Haiming Wang, Jianqiao Lu, Zhengying Liu, Linqi Song, Xiaodan Liang

TL;DR

This work introduces FVEL, an interactive formal verification environment that bridges Isabelle's rigorous formal libraries with large language models to perform theorem proving on translated code. By converting C code into Isabelle formulations and leveraging LLMs to generate lemmas and proofs while receiving real-time prover feedback, FVEL enables a neural ATP-assisted verification workflow. To support this, the authors construct FVELer, a large-scale Isabelle dataset with deep theory dependencies and step-wise lemmas (758 theories, 29,125 lemmas, 200,646 proof steps), and demonstrate tunable performance improvements on Code2Inv and SV-COMP after fine-tuning LLMs (e.g., Llama-3-8B up to 17.39% and Mistral-7B up to 12% on SV-COMP). The results indicate the feasibility and benefits of integrating LLM-based reasoning with formal verification environments, while also highlighting challenges such as data scarcity and the need for better tactic-style proof handling in Isabelle-based workflows. The work suggests significant potential for advancing reliable software verification by combining rich formal libraries with neural theorem proving, and maps a clear path for extending FVEL to additional languages and richer verification tasks.

Abstract

Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this paper, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELER3. The FVELER dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELER in the FVEL environment by first fine-tuning LLMs with FVELER and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELER fine-tuned Llama3- 8B solves 17.39% (69 -> 81) more problems, and Mistral-7B 12% (75 -> 84) more problems in SV-COMP. And the proportion of proof errors is reduced. Project page: https://fveler.github.io/.

FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving

TL;DR

This work introduces FVEL, an interactive formal verification environment that bridges Isabelle's rigorous formal libraries with large language models to perform theorem proving on translated code. By converting C code into Isabelle formulations and leveraging LLMs to generate lemmas and proofs while receiving real-time prover feedback, FVEL enables a neural ATP-assisted verification workflow. To support this, the authors construct FVELer, a large-scale Isabelle dataset with deep theory dependencies and step-wise lemmas (758 theories, 29,125 lemmas, 200,646 proof steps), and demonstrate tunable performance improvements on Code2Inv and SV-COMP after fine-tuning LLMs (e.g., Llama-3-8B up to 17.39% and Mistral-7B up to 12% on SV-COMP). The results indicate the feasibility and benefits of integrating LLM-based reasoning with formal verification environments, while also highlighting challenges such as data scarcity and the need for better tactic-style proof handling in Isabelle-based workflows. The work suggests significant potential for advancing reliable software verification by combining rich formal libraries with neural theorem proving, and maps a clear path for extending FVEL to additional languages and richer verification tasks.

Abstract

Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this paper, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELER3. The FVELER dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELER in the FVEL environment by first fine-tuning LLMs with FVELER and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELER fine-tuned Llama3- 8B solves 17.39% (69 -> 81) more problems, and Mistral-7B 12% (75 -> 84) more problems in SV-COMP. And the proportion of proof errors is reduced. Project page: https://fveler.github.io/.
Paper Structure (43 sections, 4 figures, 8 tables)

This paper contains 43 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: FVEL workflow. FVEL takes a C code as input, parses it into Isabelle definition, and then conducts interactive formal proving with FVEL-LLM/human via outputting proof state and receiving generated proof.
  • Figure 2: (a) SeL4 ROOT file structure. It provides an example ROOT file content for the session Word_Lib. (b) Theory dependency graph. Each theory file is grouped by the Session. (c) Step-wise lemmas extraction.
  • Figure 3: The FVELer dependency distributions by theory and lemma, respectively.
  • Figure 4: The FVELer lemma distribution over step intervals. We adjust the range by setting the y-axis to a logarithmic scale.