Evaluating LLM-Generated ACSL Annotations for Formal Verification

Arshad Beg; Diarmuid O'Donoghue; Rosemary Monahan

Evaluating LLM-Generated ACSL Annotations for Formal Verification

Arshad Beg, Diarmuid O'Donoghue, Rosemary Monahan

TL;DR

This paper empirically evaluates the extent to which formal-analysis tools can automatically generate and verify ACSL specifications without human or learning-based assistance, and provides new empirical evidence on the capabilities and limitations of automated ACSL generation.

Abstract

Formal specifications are crucial for building verifiable and dependable software systems, yet generating accurate and verifiable specifications for real-world C programs remains challenging. This paper empirically evaluates the extent to which formal-analysis tools can automatically generate and verify ACSL specifications without human or learning-based assistance. We conduct a controlled study on a recently released dataset of 506 C programs, repurposing it from interactive, developer-driven workflows to an automated evaluation setting. Five ACSL generation systems are compared: a rule-based Python script, Frama-C's RTE plugin, and three large language models--DeepSeek-V3.2, GPT-5.2, and OLMo 3.1 32B Instruct. All generated specifications are verified under identical conditions using the Frama-C WP plugin powered by multiple SMT solvers, allowing a direct comparison of annotation quality, solver sensitivity, and proof stability. Our results provide new empirical evidence on the capabilities and limitations of automated ACSL generation, complementing prior survey-based work.

Evaluating LLM-Generated ACSL Annotations for Formal Verification

TL;DR

Abstract

Paper Structure (12 sections, 8 figures, 7 tables)

This paper contains 12 sections, 8 figures, 7 tables.

Introduction
Related Work
Experimental Setup and Workflow
Empirical Evaluation
EVA Static Analysis
Analysis based on Mean Values
Rule-Based Python Script Generation
Frama-C RTE Generation
GPT-5.2 Generation
DeepSeek Generation
OLMo3 Generation
Conclusions

Figures (8)

Figure 1: End-to-end research workflow illustrating how a curated dataset is progressively transformed into pure C files, enriched with ACSL annotations generated via a combination of automated scripts, Frama-C RTE, and large language models (DeepSeek-V3.2, GPT-5.2, and OLMo-3.1 32B Instruct), then consolidated into annotated C files that are evaluated using weakest-precondition (WP) prover tests to produce verification results that are finally subjected to systematic analysis.
Figure 2: Visual summary of EVA analysis results. (a) Number of alarms per file. (b) Distribution of statement coverage for successful and failed analyses. (c) Relationship between kernel warnings and coverage.
Figure 3: Comparison of WP verification outcomes across ACSL generation methods. The figure contrasts proof success, solver robustness, and their interaction for tool-generated and LLM-generated specifications.
Figure 4: Solver Performance for Rule-based Python Script
Figure 5: Solver Performance for RTE-Generated ACSL
...and 3 more figures

Evaluating LLM-Generated ACSL Annotations for Formal Verification

TL;DR

Abstract

Evaluating LLM-Generated ACSL Annotations for Formal Verification

Authors

TL;DR

Abstract

Table of Contents

Figures (8)