Table of Contents
Fetching ...

How secure is AI-generated Code: A Large-Scale Comparison of Large Language Models

Norbert Tihanyi, Tamas Bisztray, Mohamed Amine Ferrag, Ridhi Jain, Lucas C. Cordeiro

TL;DR

This work assesses the security of AI-generated C code at scale by evaluating nine large language models using formal verification with ESBMC. It introduces FormAI-v2, a 331,000-sample corpus of compilable C programs labeled for vulnerabilities via unbounded model checking, enabling rigorous comparisons of model security and vulnerability patterns. The study finds that at least 62.07% of generated samples are vulnerable, with dereference failures and buffer overflows (notably from scanf-family inputs) being among the most prevalent issues, and no single model clearly outperforming others. The results highlight the practical need for risk assessment and validation frameworks when deploying LLM-generated code in production, and the FormAI-v2 dataset provides a valuable resource for ML benchmarking, vulnerability detection, and fuzzing research.

Abstract

This study compares state-of-the-art Large Language Models (LLMs) on their tendency to generate vulnerabilities when writing C programs using a neutral zero-shot prompt. Tihanyi et al. introduced the FormAI dataset at PROMISE'23, featuring 112,000 C programs generated by GPT-3.5-turbo, with over 51.24% identified as vulnerable. We extended that research with a large-scale study involving 9 state-of-the-art models such as OpenAI's GPT-4o-mini, Google's Gemini Pro 1.0, TII's 180 billion-parameter Falcon, Meta's 13 billion-parameter Code Llama, and several other compact models. Additionally, we introduce the FormAI-v2 dataset, which comprises 331 000 compilable C programs generated by these LLMs. Each program in the dataset is labeled based on the vulnerabilities detected in its source code through formal verification, using the Efficient SMT-based Context-Bounded Model Checker (ESBMC). This technique minimizes false positives by providing a counterexample for the specific vulnerability and reduces false negatives by thoroughly completing the verification process. Our study reveals that at least 62.07% of the generated programs are vulnerable. The differences between the models are minor, as they all show similar coding errors with slight variations. Our research highlights that while LLMs offer promising capabilities for code generation, deploying their output in a production environment requires proper risk assessment and validation.

How secure is AI-generated Code: A Large-Scale Comparison of Large Language Models

TL;DR

This work assesses the security of AI-generated C code at scale by evaluating nine large language models using formal verification with ESBMC. It introduces FormAI-v2, a 331,000-sample corpus of compilable C programs labeled for vulnerabilities via unbounded model checking, enabling rigorous comparisons of model security and vulnerability patterns. The study finds that at least 62.07% of generated samples are vulnerable, with dereference failures and buffer overflows (notably from scanf-family inputs) being among the most prevalent issues, and no single model clearly outperforming others. The results highlight the practical need for risk assessment and validation frameworks when deploying LLM-generated code in production, and the FormAI-v2 dataset provides a valuable resource for ML benchmarking, vulnerability detection, and fuzzing research.

Abstract

This study compares state-of-the-art Large Language Models (LLMs) on their tendency to generate vulnerabilities when writing C programs using a neutral zero-shot prompt. Tihanyi et al. introduced the FormAI dataset at PROMISE'23, featuring 112,000 C programs generated by GPT-3.5-turbo, with over 51.24% identified as vulnerable. We extended that research with a large-scale study involving 9 state-of-the-art models such as OpenAI's GPT-4o-mini, Google's Gemini Pro 1.0, TII's 180 billion-parameter Falcon, Meta's 13 billion-parameter Code Llama, and several other compact models. Additionally, we introduce the FormAI-v2 dataset, which comprises 331 000 compilable C programs generated by these LLMs. Each program in the dataset is labeled based on the vulnerabilities detected in its source code through formal verification, using the Efficient SMT-based Context-Bounded Model Checker (ESBMC). This technique minimizes false positives by providing a counterexample for the specific vulnerability and reduces false negatives by thoroughly completing the verification process. Our study reveals that at least 62.07% of the generated programs are vulnerable. The differences between the models are minor, as they all show similar coding errors with slight variations. Our research highlights that while LLMs offer promising capabilities for code generation, deploying their output in a production environment requires proper risk assessment and validation.
Paper Structure (28 sections, 1 equation, 9 figures, 14 tables)

This paper contains 28 sections, 1 equation, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Motivation example: GPT-4 produced code with security vulnerabilities, demonstrated through formal verification results.
  • Figure 2: GPT-4 generated code snippet response after requesting a secure version of the code in Figure 1.
  • Figure 3: Zero-shot prompt requesting a fix for integer overflow (failing to do so).
  • Figure 4: FormAI-v2 dataset generation Framework using different LLMs.
  • Figure 5: Dynamic Code Generation Prompt.
  • ...and 4 more figures