Table of Contents
Fetching ...

Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study

Karl Tamberg, Hayretdin Bahsi

TL;DR

This study benchmarks off-the-shelf large language models (LLMs) for software vulnerability detection, treating LLMs as static code analysers and rigorously comparing prompting strategies across multiple models. Using the Juliet Java 1.3 dataset, the authors evaluate several prompting techniques (dataflow analysis, chain-of-thought, self-refinement, recursion-based re-evaluation, and more) and contrast performance against CodeQL and SpotBugs. They show that, with model- and prompt-specific configurations, LLMs can outperform traditional static analysers in recall and $F1$, while incurring higher cost and latency, highlighting trade-offs between coverage and practicality. The work provides practical guidance for practitioners on selecting prompts and models for vulnerability detection and discusses limitations, CWE-mapping challenges, and avenues for future research and tool-synergy.

Abstract

Despite various approaches being employed to detect vulnerabilities, the number of reported vulnerabilities shows an upward trend over the years. This suggests the problems are not caught before the code is released, which could be caused by many factors, like lack of awareness, limited efficacy of the existing vulnerability detection tools or the tools not being user-friendly. To help combat some issues with traditional vulnerability detection tools, we propose using large language models (LLMs) to assist in finding vulnerabilities in source code. LLMs have shown a remarkable ability to understand and generate code, underlining their potential in code-related tasks. The aim is to test multiple state-of-the-art LLMs and identify the best prompting strategies, allowing extraction of the best value from the LLMs. We provide an overview of the strengths and weaknesses of the LLM-based approach and compare the results to those of traditional static analysis tools. We find that LLMs can pinpoint many more issues than traditional static analysis tools, outperforming traditional tools in terms of recall and F1 scores. The results should benefit software developers and security analysts responsible for ensuring that the code is free of vulnerabilities.

Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study

TL;DR

This study benchmarks off-the-shelf large language models (LLMs) for software vulnerability detection, treating LLMs as static code analysers and rigorously comparing prompting strategies across multiple models. Using the Juliet Java 1.3 dataset, the authors evaluate several prompting techniques (dataflow analysis, chain-of-thought, self-refinement, recursion-based re-evaluation, and more) and contrast performance against CodeQL and SpotBugs. They show that, with model- and prompt-specific configurations, LLMs can outperform traditional static analysers in recall and , while incurring higher cost and latency, highlighting trade-offs between coverage and practicality. The work provides practical guidance for practitioners on selecting prompts and models for vulnerability detection and discusses limitations, CWE-mapping challenges, and avenues for future research and tool-synergy.

Abstract

Despite various approaches being employed to detect vulnerabilities, the number of reported vulnerabilities shows an upward trend over the years. This suggests the problems are not caught before the code is released, which could be caused by many factors, like lack of awareness, limited efficacy of the existing vulnerability detection tools or the tools not being user-friendly. To help combat some issues with traditional vulnerability detection tools, we propose using large language models (LLMs) to assist in finding vulnerabilities in source code. LLMs have shown a remarkable ability to understand and generate code, underlining their potential in code-related tasks. The aim is to test multiple state-of-the-art LLMs and identify the best prompting strategies, allowing extraction of the best value from the LLMs. We provide an overview of the strengths and weaknesses of the LLM-based approach and compare the results to those of traditional static analysis tools. We find that LLMs can pinpoint many more issues than traditional static analysis tools, outperforming traditional tools in terms of recall and F1 scores. The results should benefit software developers and security analysts responsible for ensuring that the code is free of vulnerabilities.
Paper Structure (26 sections, 1 equation, 6 tables)