FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving
Xiaohan Lin, Qingxing Cao, Yinya Huang, Haiming Wang, Jianqiao Lu, Zhengying Liu, Linqi Song, Xiaodan Liang
TL;DR
This work introduces FVEL, an interactive formal verification environment that bridges Isabelle's rigorous formal libraries with large language models to perform theorem proving on translated code. By converting C code into Isabelle formulations and leveraging LLMs to generate lemmas and proofs while receiving real-time prover feedback, FVEL enables a neural ATP-assisted verification workflow. To support this, the authors construct FVELer, a large-scale Isabelle dataset with deep theory dependencies and step-wise lemmas (758 theories, 29,125 lemmas, 200,646 proof steps), and demonstrate tunable performance improvements on Code2Inv and SV-COMP after fine-tuning LLMs (e.g., Llama-3-8B up to 17.39% and Mistral-7B up to 12% on SV-COMP). The results indicate the feasibility and benefits of integrating LLM-based reasoning with formal verification environments, while also highlighting challenges such as data scarcity and the need for better tactic-style proof handling in Isabelle-based workflows. The work suggests significant potential for advancing reliable software verification by combining rich formal libraries with neural theorem proving, and maps a clear path for extending FVEL to additional languages and richer verification tasks.
Abstract
Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this paper, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELER3. The FVELER dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELER in the FVEL environment by first fine-tuning LLMs with FVELER and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELER fine-tuned Llama3- 8B solves 17.39% (69 -> 81) more problems, and Mistral-7B 12% (75 -> 84) more problems in SV-COMP. And the proportion of proof errors is reduced. Project page: https://fveler.github.io/.
