Table of Contents
Fetching ...

Verifying LLM-Generated Code in the Context of Software Verification with Ada/SPARK

Marcos Cramer, Lucian McIntyre

TL;DR

This work tackles the challenge of trusting LLM-generated code by pairing it with formal verification in the SPARK/Ada ecosystem. It presents Marmaragan, a prototype that uses an LLM to generate SPARK annotations and GNATprove to verify the resulting code, exploring a hybrid AI–formal methods approach. Benchmarking on 16 SPARK programs across five removal schemas shows Marmaragan correctly annotates $N$-Solutions and $Retries$ strategies in about $50.7\%$ of cases, illustrating both promise and limits of the method. The study demonstrates a viable foundation for integrating LLMs with formal verification to produce safer, verifiable software and outlines concrete directions for scaling and industrial adoption.

Abstract

Large language models (LLMs) have demonstrated remarkable code generation capabilities, but the correctness of the generated code cannot be inherently trusted. This paper explores the feasibility of using formal software verification, specifically the SPARK framework for Ada, to ensure the reliability of LLM-generated code. We present Marmaragan, a tool that leverages an LLM in order to generate SPARK annotations for existing programs, enabling formal verification of the code. The tool is benchmarked on a curated set of SPARK programs, with annotations selectively removed to test specific capabilities. The performance of Marmaragan with GPT-4o on the benchmark is promising, with correct annotations having been generated for 50.7% of the benchmark cases. The results establish a foundation for future work on combining the power of LLMs with the reliability of formal software verification.

Verifying LLM-Generated Code in the Context of Software Verification with Ada/SPARK

TL;DR

This work tackles the challenge of trusting LLM-generated code by pairing it with formal verification in the SPARK/Ada ecosystem. It presents Marmaragan, a prototype that uses an LLM to generate SPARK annotations and GNATprove to verify the resulting code, exploring a hybrid AI–formal methods approach. Benchmarking on 16 SPARK programs across five removal schemas shows Marmaragan correctly annotates -Solutions and strategies in about of cases, illustrating both promise and limits of the method. The study demonstrates a viable foundation for integrating LLMs with formal verification to produce safer, verifiable software and outlines concrete directions for scaling and industrial adoption.

Abstract

Large language models (LLMs) have demonstrated remarkable code generation capabilities, but the correctness of the generated code cannot be inherently trusted. This paper explores the feasibility of using formal software verification, specifically the SPARK framework for Ada, to ensure the reliability of LLM-generated code. We present Marmaragan, a tool that leverages an LLM in order to generate SPARK annotations for existing programs, enabling formal verification of the code. The tool is benchmarked on a curated set of SPARK programs, with annotations selectively removed to test specific capabilities. The performance of Marmaragan with GPT-4o on the benchmark is promising, with correct annotations having been generated for 50.7% of the benchmark cases. The results establish a foundation for future work on combining the power of LLMs with the reliability of formal software verification.

Paper Structure

This paper contains 35 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of the control flow in Marmaragan. Stage 1 initializes input parameters and generates the SPARK directory. Stage 2 begins the iteration over each of the files in the benchmark, formatting the prompt and invoking the LLM. Stage 3 is output processing, in Stage 4 GNATprove is run and mediums are extracted. Processes in Stage 5 only run if enabled: the prompt is reformatted with the previous incorrect solution and GNATprove medium(s) are also included.
  • Figure 2: Example of Gnatprove precompiled and formatted medium messages, which are added to the prompt
  • Figure 3: This chart details the distribution of programs from each of the three sources across the five benchmarks
  • Figure 4: Comparison of the number of programs solved per experiment. The bar is composed of the stacked number of solutions per benchmark.
  • Figure 5: Summary of the total amount of programs solved, per benchmark, across all five experiments. The chart is a 100% bar chart, thus the y-axis represents the amount of programs that were solved, as a percentage of the total number of programs in the benchmark.
  • ...and 2 more figures