Can ChatGPT support software verification?
Christian Janßen, Cedric Richter, Heike Wehrheim
TL;DR
The paper investigates whether large language models, specifically ChatGPT, can support formal software verification by generating loop invariants for C programs annotated in ACSL. It annotates 106 SV-COMP Loops tasks with invariants produced by ChatGPT and evaluates their validity with the interactive verifier Frama-C and usefulness with Frama-C SV and CPAchecker, revealing 75 valid invariants and substantial improvements in solvable tasks. The study highlights a copy-assertion heuristic and other patterns that yield useful invariants, while also identifying limitations such as hallucinations, input-length constraints, and the need for a cooperative loop among LLMs, validators, and verifiers. Overall, the work demonstrates a viable path to augment formal verification workflows with LLMs, while outlining concrete directions for improving reliability, interoperability, and practical impact.
Abstract
Large language models have become increasingly effective in software engineering tasks such as code generation, debugging and repair. Language models like ChatGPT can not only generate code, but also explain its inner workings and in particular its correctness. This raises the question whether we can utilize ChatGPT to support formal software verification. In this paper, we take some first steps towards answering this question. More specifically, we investigate whether ChatGPT can generate loop invariants. Loop invariant generation is a core task in software verification, and the generation of valid and useful invariants would likely help formal verifiers. To provide some first evidence on this hypothesis, we ask ChatGPT to annotate 106 C programs with loop invariants. We check validity and usefulness of the generated invariants by passing them to two verifiers, Frama-C and CPAchecker. Our evaluation shows that ChatGPT is able to produce valid and useful invariants allowing Frama-C to verify tasks that it could not solve before. Based on our initial insights, we propose ways of combining ChatGPT (or large language models in general) and software verifiers, and discuss current limitations and open issues.
