Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection

Xin Zhou; Duc-Manh Tran; Thanh Le-Cong; Ting Zhang; Ivana Clairine Irsan; Joshua Sumarlin; Bach Le; David Lo

Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection

Xin Zhou, Duc-Manh Tran, Thanh Le-Cong, Ting Zhang, Ivana Clairine Irsan, Joshua Sumarlin, Bach Le, David Lo

TL;DR

This study tackles vulnerability detection by introducing repo-level vulnerability detection and benchmarking 15 SAST tools against 12 open-source LLMs across Java, C, and Python. It formulates a mapping $M:\mathcal{X}\mapsto\mathcal{Y}$ to assign per-function vulnerability labels within entire repositories and evaluates diverse LLM adaptation strategies (prompt-based and LoRA-based fine-tuning) alongside SAST analyses. Key findings show SAST tools deliver low detection rates with low false positives, while LLMs achieve high detection rates but with many false positives; ensemble approaches—across SAST tools or LLMs—can mitigate these weaknesses and tailorBest choices are language-dependent. The results offer practical guidance for deploying combined SAST/LLM strategies and point to future research avenues in improving precision, contextual prompting, and scalable repo-wide vulnerability detection.

Abstract

Software vulnerabilities pose significant security challenges and potential risks to society, necessitating extensive efforts in automated vulnerability detection. There are two popular lines of work to address automated vulnerability detection. On one hand, Static Application Security Testing (SAST) is usually utilized to scan source code for security vulnerabilities, especially in industries. On the other hand, deep learning (DL)-based methods, especially since the introduction of large language models (LLMs), have demonstrated their potential in software vulnerability detection. However, there is no comparative study between SAST tools and LLMs, aiming to determine their effectiveness in vulnerability detection, understand the pros and cons of both SAST and LLMs, and explore the potential combination of these two families of approaches. In this paper, we compared 15 diverse SAST tools with 12 popular or state-of-the-art open-source LLMs in detecting software vulnerabilities from repositories of three popular programming languages: Java, C, and Python. The experimental results showed that SAST tools obtain low vulnerability detection rates with relatively low false positives, while LLMs can detect up 90\% to 100\% of vulnerabilities but suffer from high false positives. By further ensembling the SAST tools and LLMs, the drawbacks of both SAST tools and LLMs can be mitigated to some extent. Our analysis sheds light on both the current progress and future directions for software vulnerability detection.

Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection

TL;DR

to assign per-function vulnerability labels within entire repositories and evaluates diverse LLM adaptation strategies (prompt-based and LoRA-based fine-tuning) alongside SAST analyses. Key findings show SAST tools deliver low detection rates with low false positives, while LLMs achieve high detection rates but with many false positives; ensemble approaches—across SAST tools or LLMs—can mitigate these weaknesses and tailorBest choices are language-dependent. The results offer practical guidance for deploying combined SAST/LLM strategies and point to future research avenues in improving precision, contextual prompting, and scalable repo-wide vulnerability detection.

Abstract

Paper Structure (35 sections, 3 equations, 9 figures, 4 tables)

This paper contains 35 sections, 3 equations, 9 figures, 4 tables.

Introduction
Study Design
Repo-level Vulnerability Detection
Repo-level Vulnerability Detection Task Formation
Comparison with Function-level Vulnerability Detection
Applying LLMs for Repo-level Vulnerability Detection
Vulnerability Datasets
SAST Tools Selection
Studied LLMs
LLM Adaptation Techniques
Prompt-based Methods
Fine-tuning-based Methods
Experimental Setup
Vulnerability Detection Scenarios
Evaluation Metrics
...and 20 more sections

Figures (9)

Figure 1: Overview of Our Study
Figure 2: Repo-level and Function-level Vul. Detection
Figure 3: Java Benchmark
Figure 4: C Benchmark
Figure 5: Python Benchmark
...and 4 more figures

Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection

TL;DR

Abstract

Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (9)