Automated Repair of AI Code with Large Language Models and Formal Verification

Yiannis Charalambous; Edoardo Manino; Lucas C. Cordeiro

Automated Repair of AI Code with Large Language Models and Formal Verification

Yiannis Charalambous, Edoardo Manino, Lucas C. Cordeiro

TL;DR

This work tackles safety guarantees for AI software by combining formal verification with automated repair. It expands NeuroCodeBench to about $81k$ C-based neural network implementations using Mull mutations, and uses ESBMC to label memory-safety vulnerabilities, treating it as an oracle for repair. Large language models are then employed to repair the AI code, with a comprehensive study of prompts, contexts, and iterative strategies (Forward History, LSO, etc.), revealing that carefully designed prompts can achieve meaningful repair rates, with iterative APR yielding around a $7\%$ improvement over non-iterative approaches. The findings highlight the potential and current limits of integrating formal verification and LLM-based program repair for AI code, and point to future work on broader LLMs and cost-conscious iterative strategies. The approach provides a practical pathway toward safer, verifiable AI software by tightly coupling vulnerability detection with automated, model-guided repair under formal guarantees.

Abstract

The next generation of AI systems requires strong safety guarantees. This report looks at the software implementation of neural networks and related memory safety properties, including NULL pointer deference, out-of-bound access, double-free, and memory leaks. Our goal is to detect these vulnerabilities, and automatically repair them with the help of large language models. To this end, we first expand the size of NeuroCodeBench, an existing dataset of neural network code, to about 81k programs via an automated process of program mutation. Then, we verify the memory safety of the mutated neural network implementations with ESBMC, a state-of-the-art software verifier. Whenever ESBMC spots a vulnerability, we invoke a large language model to repair the source code. For the latest task, we compare the performance of various state-of-the-art prompt engineering techniques, and an iterative approach that repeatedly calls the large language model.

Automated Repair of AI Code with Large Language Models and Formal Verification

TL;DR

This work tackles safety guarantees for AI software by combining formal verification with automated repair. It expands NeuroCodeBench to about

C-based neural network implementations using Mull mutations, and uses ESBMC to label memory-safety vulnerabilities, treating it as an oracle for repair. Large language models are then employed to repair the AI code, with a comprehensive study of prompts, contexts, and iterative strategies (Forward History, LSO, etc.), revealing that carefully designed prompts can achieve meaningful repair rates, with iterative APR yielding around a

improvement over non-iterative approaches. The findings highlight the potential and current limits of integrating formal verification and LLM-based program repair for AI code, and point to future work on broader LLMs and cost-conscious iterative strategies. The approach provides a practical pathway toward safer, verifiable AI software by tightly coupling vulnerability detection with automated, model-guided repair under formal guarantees.

Abstract

Paper Structure (70 sections, 1 equation, 29 figures, 5 tables, 1 algorithm)

This paper contains 70 sections, 1 equation, 29 figures, 5 tables, 1 algorithm.

Introduction
Background
NeuroCodeBench
Hopfield Networks.
Polynomial Approximation Networks.
VNN-COMP Networks.
Safety Properties
Reachability.
Memory safety.
Generative Language Models
Software Verification
Creating An AI Code Dataset
Program Mutation
Data Augmentation
Building the Base Samples.
...and 55 more sections

Figures (29)

Figure 1: Diagrammatic layout of a transformer model architecture vaswani2017attention.
Figure 2: A diagram visualization of how bounded model checking works. The visualization represents a state transition system $M$, a given property $\phi$, and a verification condition $\psi$lucassoftwaresecurity.
Figure 3: Overview of the NeuroCodeBench expansion pipeline. It consists of three key stages.
Figure 4: Performance of ESBMC on $25952$/$81129$ mutated programs, by neural network category.
Figure 5: Flowchart of a single attempt at repairing AI C Code with the LLM.
...and 24 more figures

Automated Repair of AI Code with Large Language Models and Formal Verification

TL;DR

Abstract

Automated Repair of AI Code with Large Language Models and Formal Verification

Authors

TL;DR

Abstract

Table of Contents

Figures (29)