Table of Contents
Fetching ...

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, Mayur Naik

TL;DR

This work provides the most comprehensive empirical evaluation to date of pre-trained large language models for security vulnerability detection across multiple languages and real-world datasets. It shows that LLMs achieve modest overall accuracy but can outperform static analyzers on certain vulnerability classes, especially with advanced prompting that simulates dataflow reasoning. The study reveals that model scale alone does not guarantee better performance on real-world code and highlights the value of combining LLMs with symbolic/static analysis to improve coverage and explainability. The findings guide practical deployment of LLM-informed vulnerability detection and motivate neuro-symbolic approaches that leverage LLM reasoning with traditional analysis tools.

Abstract

While automated vulnerability detection techniques have made promising progress in detecting security vulnerabilities, their scalability and applicability remain challenging. The remarkable performance of Large Language Models (LLMs), such as GPT-4 and CodeLlama, on code-related tasks has prompted recent works to explore if LLMs can be used to detect vulnerabilities. In this paper, we perform a more comprehensive study by concurrently examining a higher number of datasets, languages and LLMs, and qualitatively evaluating performance across prompts and vulnerability classes while addressing the shortcomings of existing tools. Concretely, we evaluate the effectiveness of 16 pre-trained LLMs on 5,000 code samples from five diverse security datasets. These balanced datasets encompass both synthetic and real-world projects in Java and C/C++ and cover 25 distinct vulnerability classes. Overall, LLMs across all scales and families show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets. They are significantly better at detecting vulnerabilities only requiring intra-procedural analysis, such as OS Command Injection and NULL Pointer Dereference. Moreover, they report higher accuracies on these vulnerabilities than popular static analysis tools, such as CodeQL. We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average). Interestingly, we observe that LLMs show promising abilities at performing parts of the analysis correctly, such as identifying vulnerability-related specifications and leveraging natural language information to understand code behavior (e.g., to check if code is sanitized). We expect our insights to guide future work on LLM-augmented vulnerability detection systems.

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

TL;DR

This work provides the most comprehensive empirical evaluation to date of pre-trained large language models for security vulnerability detection across multiple languages and real-world datasets. It shows that LLMs achieve modest overall accuracy but can outperform static analyzers on certain vulnerability classes, especially with advanced prompting that simulates dataflow reasoning. The study reveals that model scale alone does not guarantee better performance on real-world code and highlights the value of combining LLMs with symbolic/static analysis to improve coverage and explainability. The findings guide practical deployment of LLM-informed vulnerability detection and motivate neuro-symbolic approaches that leverage LLM reasoning with traditional analysis tools.

Abstract

While automated vulnerability detection techniques have made promising progress in detecting security vulnerabilities, their scalability and applicability remain challenging. The remarkable performance of Large Language Models (LLMs), such as GPT-4 and CodeLlama, on code-related tasks has prompted recent works to explore if LLMs can be used to detect vulnerabilities. In this paper, we perform a more comprehensive study by concurrently examining a higher number of datasets, languages and LLMs, and qualitatively evaluating performance across prompts and vulnerability classes while addressing the shortcomings of existing tools. Concretely, we evaluate the effectiveness of 16 pre-trained LLMs on 5,000 code samples from five diverse security datasets. These balanced datasets encompass both synthetic and real-world projects in Java and C/C++ and cover 25 distinct vulnerability classes. Overall, LLMs across all scales and families show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets. They are significantly better at detecting vulnerabilities only requiring intra-procedural analysis, such as OS Command Injection and NULL Pointer Dereference. Moreover, they report higher accuracies on these vulnerabilities than popular static analysis tools, such as CodeQL. We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average). Interestingly, we observe that LLMs show promising abilities at performing parts of the analysis correctly, such as identifying vulnerability-related specifications and leveraging natural language information to understand code behavior (e.g., to check if code is sanitized). We expect our insights to guide future work on LLM-augmented vulnerability detection systems.
Paper Structure (33 sections, 8 figures, 12 tables)

This paper contains 33 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Effectiveness of LLMs in Predicting Security Vulnerabilities in Java and C/C++ (highest accuracy and F1 scores per model per dataset, across all prompting strategies).
  • Figure 2: Performance of different prompting strategies
  • Figure 3: Accuracy Across CWEs on real-world datasets.
  • Figure 4: Accuracy Across CWEs on synthetic datasets.
  • Figure : CodeLlama-7B correctly predicts this code is not vulnerable to Integer Overflow but CodeLlama-13B does not.
  • ...and 3 more figures