Table of Contents
Fetching ...

A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models

Sabrina Kaniewski, Fabian Schmidt, Markus Enzweiler, Michael Menth, Tobias Heer

TL;DR

This SLR tackles the fragmentation in LLM-based software vulnerability detection by systematizing 263 studies into a fine-grained taxonomy across task formulation, input representations, system architecture, adaptation techniques, and datasets. It reveals that binary classification dominates, while sophisticated approaches like PEFT, RAG, and multi-task learning are just emerging. The review provides a critical analysis of vulnerability datasets, CWE coverage, and long-tail distribution, highlighting data quality, comparability, and up-to-date knowledge as major barriers. It offers actionable guidance on structure-aware inputs, standardized evaluation, and integration into real development workflows, supported by a living replication package to foster reproducibility and ongoing updates.

Abstract

The increasing adoption of Large Language Models (LLMs) in software engineering has sparked interest in their use for software vulnerability detection. However, the rapid development of this field has resulted in a fragmented research landscape, with diverse studies that are difficult to compare due to differences in, e.g., system designs and dataset usage. This fragmentation makes it difficult to obtain a clear overview of the state-of-the-art or compare and categorize studies meaningfully. In this work, we present a comprehensive systematic literature review (SLR) of LLM-based software vulnerability detection. We analyze 263 studies published between January 2020 and November 2025, categorizing them by task formulation, input representation, system architecture, and techniques. Further, we analyze the datasets used, including their characteristics, vulnerability coverage, and diversity. We present a fine-grained taxonomy of vulnerability detection approaches, identify key limitations, and outline actionable future research opportunities. By providing a structured overview of the field, this review improves transparency and serves as a practical guide for researchers and practitioners aiming to conduct more comparable and reproducible research. We publicly release all artifacts and maintain a living repository of LLM-based software vulnerability detection studies at https://github.com/hs-esslingen-it-security/Awesome-LLM4SVD.

A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models

TL;DR

This SLR tackles the fragmentation in LLM-based software vulnerability detection by systematizing 263 studies into a fine-grained taxonomy across task formulation, input representations, system architecture, adaptation techniques, and datasets. It reveals that binary classification dominates, while sophisticated approaches like PEFT, RAG, and multi-task learning are just emerging. The review provides a critical analysis of vulnerability datasets, CWE coverage, and long-tail distribution, highlighting data quality, comparability, and up-to-date knowledge as major barriers. It offers actionable guidance on structure-aware inputs, standardized evaluation, and integration into real development workflows, supported by a living replication package to foster reproducibility and ongoing updates.

Abstract

The increasing adoption of Large Language Models (LLMs) in software engineering has sparked interest in their use for software vulnerability detection. However, the rapid development of this field has resulted in a fragmented research landscape, with diverse studies that are difficult to compare due to differences in, e.g., system designs and dataset usage. This fragmentation makes it difficult to obtain a clear overview of the state-of-the-art or compare and categorize studies meaningfully. In this work, we present a comprehensive systematic literature review (SLR) of LLM-based software vulnerability detection. We analyze 263 studies published between January 2020 and November 2025, categorizing them by task formulation, input representation, system architecture, and techniques. Further, we analyze the datasets used, including their characteristics, vulnerability coverage, and diversity. We present a fine-grained taxonomy of vulnerability detection approaches, identify key limitations, and outline actionable future research opportunities. By providing a structured overview of the field, this review improves transparency and serves as a practical guide for researchers and practitioners aiming to conduct more comparable and reproducible research. We publicly release all artifacts and maintain a living repository of LLM-based software vulnerability detection studies at https://github.com/hs-esslingen-it-security/Awesome-LLM4SVD.

Paper Structure

This paper contains 59 sections, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Literature search and study selection pipeline. We use the constructed search string, cf. Section \ref{['sec:search_string']}, for the automated search in the selected databases. Further, we continuously add to the selected studies via search alerts and scholar.inbox notifications. Crawled studies go through the pipeline of inclusion and exclusion criteria as well as snowballing. Study statistics as of December 13th 2025.
  • Figure 2: Publication characteristics of the included studies: (a) distribution of studies per year, (b) relation of peer-reviewed vs. preprint (arXiv) studies, and (c) distribution of conference and journal papers across venue tiers, using CORE for conference rankings and SJR for journal rankings.
  • Figure 3: Taxonomy of LLM-based vulnerability detection studies with numbering in parentheses. We omit individual numbering in lower-level nodes for readability. A study may be associated with multiple values (i.e., white boxes) per category.
  • Figure 4: Relationship between classification and generation task formulations. The minority of studies combines the classification task with a generative task. Numbers in braces indicate the absolute count of unique studies applying the task formulation; a single study may perform multiple tasks.
  • Figure 5: Relationship between model scale and adaptation techniques across surveyed studies. Tiny and small models are predominantly adapted via full-parameter fine-tuning, whereas medium and large models increasingly utilize parameter-efficient fine-tuning and prompt engineering. Numbers in braces indicate the absolute count of unique studies applying the respective adaptation techniques; a single study may apply multiple adaptation techniques.
  • ...and 8 more figures