Table of Contents
Fetching ...

Automated Repair of AI Code with Large Language Models and Formal Verification

Yiannis Charalambous, Edoardo Manino, Lucas C. Cordeiro

TL;DR

This work tackles safety guarantees for AI software by combining formal verification with automated repair. It expands NeuroCodeBench to about $81k$ C-based neural network implementations using Mull mutations, and uses ESBMC to label memory-safety vulnerabilities, treating it as an oracle for repair. Large language models are then employed to repair the AI code, with a comprehensive study of prompts, contexts, and iterative strategies (Forward History, LSO, etc.), revealing that carefully designed prompts can achieve meaningful repair rates, with iterative APR yielding around a $7\%$ improvement over non-iterative approaches. The findings highlight the potential and current limits of integrating formal verification and LLM-based program repair for AI code, and point to future work on broader LLMs and cost-conscious iterative strategies. The approach provides a practical pathway toward safer, verifiable AI software by tightly coupling vulnerability detection with automated, model-guided repair under formal guarantees.

Abstract

The next generation of AI systems requires strong safety guarantees. This report looks at the software implementation of neural networks and related memory safety properties, including NULL pointer deference, out-of-bound access, double-free, and memory leaks. Our goal is to detect these vulnerabilities, and automatically repair them with the help of large language models. To this end, we first expand the size of NeuroCodeBench, an existing dataset of neural network code, to about 81k programs via an automated process of program mutation. Then, we verify the memory safety of the mutated neural network implementations with ESBMC, a state-of-the-art software verifier. Whenever ESBMC spots a vulnerability, we invoke a large language model to repair the source code. For the latest task, we compare the performance of various state-of-the-art prompt engineering techniques, and an iterative approach that repeatedly calls the large language model.

Automated Repair of AI Code with Large Language Models and Formal Verification

TL;DR

This work tackles safety guarantees for AI software by combining formal verification with automated repair. It expands NeuroCodeBench to about C-based neural network implementations using Mull mutations, and uses ESBMC to label memory-safety vulnerabilities, treating it as an oracle for repair. Large language models are then employed to repair the AI code, with a comprehensive study of prompts, contexts, and iterative strategies (Forward History, LSO, etc.), revealing that carefully designed prompts can achieve meaningful repair rates, with iterative APR yielding around a improvement over non-iterative approaches. The findings highlight the potential and current limits of integrating formal verification and LLM-based program repair for AI code, and point to future work on broader LLMs and cost-conscious iterative strategies. The approach provides a practical pathway toward safer, verifiable AI software by tightly coupling vulnerability detection with automated, model-guided repair under formal guarantees.

Abstract

The next generation of AI systems requires strong safety guarantees. This report looks at the software implementation of neural networks and related memory safety properties, including NULL pointer deference, out-of-bound access, double-free, and memory leaks. Our goal is to detect these vulnerabilities, and automatically repair them with the help of large language models. To this end, we first expand the size of NeuroCodeBench, an existing dataset of neural network code, to about 81k programs via an automated process of program mutation. Then, we verify the memory safety of the mutated neural network implementations with ESBMC, a state-of-the-art software verifier. Whenever ESBMC spots a vulnerability, we invoke a large language model to repair the source code. For the latest task, we compare the performance of various state-of-the-art prompt engineering techniques, and an iterative approach that repeatedly calls the large language model.
Paper Structure (70 sections, 1 equation, 29 figures, 5 tables, 1 algorithm)

This paper contains 70 sections, 1 equation, 29 figures, 5 tables, 1 algorithm.

Figures (29)

  • Figure 1: Diagrammatic layout of a transformer model architecture vaswani2017attention.
  • Figure 2: A diagram visualization of how bounded model checking works. The visualization represents a state transition system $M$, a given property $\phi$, and a verification condition $\psi$lucassoftwaresecurity.
  • Figure 3: Overview of the NeuroCodeBench expansion pipeline. It consists of three key stages.
  • Figure 4: Performance of ESBMC on $25952$/$81129$ mutated programs, by neural network category.
  • Figure 5: Flowchart of a single attempt at repairing AI C Code with the LLM.
  • ...and 24 more figures