Selene: Pioneering Automated Proof in Software Verification
Lichen Zhang, Shuai Lu, Nan Duan
TL;DR
Selene introduces a project-level automated proof benchmark for industrial-scale software verification, grounded in the seL4 microkernel and Isabelle-based verification. It provides a lightweight lemma-isolation framework and end-to-end proof generation pipeline that enables rapid evaluation of LLMs like GPT-3.5-turbo and GPT-4 on real-world dependencies. The study shows that GPT-4 can achieve meaningful proof success, especially for simpler, declarative lemmas, and demonstrates that augmentation strategies (similar lemmas, dependency facts, and error-guided fixes) can improve performance. Collectively, Selene demonstrates both the promise and the current limits of using LLMs to automate formal proofs in large-scale software verification, offering concrete directions for future research, including interactive proof state and advanced fact extraction.
Abstract
Ensuring correctness is a pivotal aspect of software engineering. Among the various strategies available, software verification offers a definitive assurance of correctness. Nevertheless, writing verification proofs is resource-intensive and manpower-consuming, and there is a great need to automate this process. We introduce Selene in this paper, which is the first project-level automated proof benchmark constructed based on the real-world industrial-level operating system microkernel, seL4. Selene provides a comprehensive framework for end-to-end proof generation and a lightweight verification environment. Our experimental results with advanced large language models (LLMs), such as GPT-3.5-turbo and GPT-4, highlight the capabilities of LLMs in the domain of automated proof generation. Additionally, our further proposed augmentations indicate that the challenges presented by Selene can be mitigated in future research endeavors.
