Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFast
Wen Fan, Marilyn Rego, Xin Hu, Sanya Dod, Zhaorui Ni, Danning Xie, Jenna DiVincenzo, Lin Tan
TL;DR
This study evaluates whether GPT-4o can generate C code specifications in VeriFast that are verifiable within a separation-logic framework. Using 21 VeriFast benchmarks and three input types (NL, FB, FB+) plus Basic and Chain-of-Thought prompting, the authors perform qualitative analysis to assess functional preservation, verifiability, and conventionality. They find that GPT-4o largely preserves functional behavior but yields many errors that prevent automated verification, with only a small portion verifiable and those often containing redundancies. The work highlights the need for more effective prompting strategies, granularity, and potential feedback-guided approaches to improve LLM-assisted specification generation for heap-manipulating programs. Overall, the paper provides a baseline for future improvements in using LLMs to assist formal specification tasks in separation-logic-based verifiers like VeriFast.
Abstract
Static verification is a powerful method for enhancing software quality, but it demands significant human labor and resources. This is particularly true of static verifiers that reason about heap manipulating programs using an ownership logic. LLMs have shown promise in a number of software engineering activities, including code generation, test generation, proof generation for theorem provers, and specification generation for static verifiers. However, prior work has not explored how well LLMs can perform specification generation for specifications based in an ownership logic, such as separation logic. To address this gap, this paper explores OpenAI's GPT-4o model's effectiveness in generating specifications on C programs that are verifiable with VeriFast, a separation logic based static verifier. Our experiment employs three different types of user inputs as well as basic and Chain-of-Thought (CoT) prompting to assess GPT's capabilities. Our results indicate that the specifications generated by GPT-4o preserve functional behavior, but struggle to be verifiable. When the specifications are verifiable they contain redundancies. Future directions are discussed to improve the performance.
