Table of Contents
Fetching ...

SoK: On Closing the Applicability Gap in Automated Vulnerability Detection

Ezzeldin Shereen, Dan Ristea, Sanyam Vyas, Shae McFadden, Madeleine Dwyer, Chris Hicks, Vasilios Mavroudis

TL;DR

This SoK systematically analyzes Automated Vulnerability Detection (AVD) literature (79 articles, 17 empirical studies) across five core components to assess real-world applicability. It finds a narrow emphasis on function-level binary classification in C/C++, with insufficient multilingual support and heterogeneous evaluation practices that threaten reproducibility. The authors identify key gaps—diverse problem formulations, granularities, dataset quality, and open science—and propose directions such as cross-domain transfer learning, inclusion of binaries, and standardized benchmarks to bridge the applicability gap. The work highlights LLM- and GNN-based approaches as leading trends while cautioning about data leakage and the need for realistic, time-aware evaluations to drive practical adoption in software security.

Abstract

The frequent discovery of security vulnerabilities in both open-source and proprietary software underscores the urgent need for earlier detection during the development lifecycle. Initiatives such as DARPA's Artificial Intelligence Cyber Challenge (AIxCC) aim to accelerate Automated Vulnerability Detection (AVD), seeking to address this challenge by autonomously analyzing source code to identify vulnerabilities. This paper addresses two primary research questions: (RQ1) How is current AVD research distributed across its core components? (RQ2) What key areas should future research target to bridge the gap in the practical applicability of AVD throughout software development? To answer these questions, we conduct a systematization over 79 AVD articles and 17 empirical studies, analyzing them across five core components: task formulation and granularity, input programming languages and representations, detection approaches and key solutions, evaluation metrics and datasets, and reported performance. Our systematization reveals that the narrow focus of AVD research-mainly on specific tasks and programming languages-limits its practical impact and overlooks broader areas crucial for effective, real-world vulnerability detection. We identify significant challenges, including the need for diversified problem formulations, varied detection granularities, broader language support, better dataset quality, enhanced reproducibility, and increased practical impact. Based on these findings we identify research directions that will enhance the effectiveness and applicability of AVD solutions in software security.

SoK: On Closing the Applicability Gap in Automated Vulnerability Detection

TL;DR

This SoK systematically analyzes Automated Vulnerability Detection (AVD) literature (79 articles, 17 empirical studies) across five core components to assess real-world applicability. It finds a narrow emphasis on function-level binary classification in C/C++, with insufficient multilingual support and heterogeneous evaluation practices that threaten reproducibility. The authors identify key gaps—diverse problem formulations, granularities, dataset quality, and open science—and propose directions such as cross-domain transfer learning, inclusion of binaries, and standardized benchmarks to bridge the applicability gap. The work highlights LLM- and GNN-based approaches as leading trends while cautioning about data leakage and the need for realistic, time-aware evaluations to drive practical adoption in software security.

Abstract

The frequent discovery of security vulnerabilities in both open-source and proprietary software underscores the urgent need for earlier detection during the development lifecycle. Initiatives such as DARPA's Artificial Intelligence Cyber Challenge (AIxCC) aim to accelerate Automated Vulnerability Detection (AVD), seeking to address this challenge by autonomously analyzing source code to identify vulnerabilities. This paper addresses two primary research questions: (RQ1) How is current AVD research distributed across its core components? (RQ2) What key areas should future research target to bridge the gap in the practical applicability of AVD throughout software development? To answer these questions, we conduct a systematization over 79 AVD articles and 17 empirical studies, analyzing them across five core components: task formulation and granularity, input programming languages and representations, detection approaches and key solutions, evaluation metrics and datasets, and reported performance. Our systematization reveals that the narrow focus of AVD research-mainly on specific tasks and programming languages-limits its practical impact and overlooks broader areas crucial for effective, real-world vulnerability detection. We identify significant challenges, including the need for diversified problem formulations, varied detection granularities, broader language support, better dataset quality, enhanced reproducibility, and increased practical impact. Based on these findings we identify research directions that will enhance the effectiveness and applicability of AVD solutions in software security.

Paper Structure

This paper contains 42 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Vulnerability analysis pipeline. The flow between detection, exploitation, and repair illustrates the stages of vulnerability management, where detection identifies vulnerabilities, exploitation assesses their impact, and repair mitigates the risks through patching.
  • Figure 2: Summary of the article collection and screening process.
  • Figure 3: Distribution of the publication year of included papers after the full-text screening phase. Data for 2024 covers Jan-Aug.
  • Figure 4: The five main components of AVD literature systematized in the paper. Each component is studied in one sub-section of Section \ref{['sec:sok']}.
  • Figure 5: Comparison of commonly-used programming languages according to (1) the proportion in current AVD research, (2) their proportion of the 2023 top 25 MITRE dangerous software weaknesses they are vulnerable to, (3) the proportional score of the MITRE danger score, (4) their 2024 IEEE Trending rank (proportion of the most-used language, Python), and (5) their proportion in GitHub pushes from Q1 of 2024.
  • ...and 3 more figures