Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection

Niklas Risse; Jing Liu; Marcel Böhme

Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection

Niklas Risse, Jing Liu, Marcel Böhme

TL;DR

This paper interrogates the prevailing ML4VD practice of treating vulnerability detection as function-level binary classification, revealing that most vulnerabilities are context-dependent and cannot be decided from a function alone. Through a literature survey of 81 papers (2020–2024) and an empirical study on BigVul, Devign, and DiverseVul, the authors show substantial label noise and that high performance can be achieved via spurious, context-free features such as word counts. They demonstrate that all true vulnerabilities in their sample required external context, casting doubt on function-level benchmarking as a valid proxy for real-world vulnerability detection. The work advocates context-aware benchmarking, alternative problem formulations (e.g., abstention and inter-procedural analysis), and closer integration with static analysis to ensure that ML4VD research measures genuine vulnerability-detection capabilities with meaningful practical impact.

Abstract

According to our survey of machine learning for vulnerability detection (ML4VD), 9 in every 10 papers published in the past five years define ML4VD as a function-level binary classification problem: Given a function, does it contain a security flaw? From our experience as security researchers, faced with deciding whether a given function makes the program vulnerable to attacks, we would often first want to understand the context in which this function is called. In this paper, we study how often this decision can really be made without further context and study both vulnerable and non-vulnerable functions in the most popular ML4VD datasets. We call a function "vulnerable" if it was involved in a patch of an actual security flaw and confirmed to cause the program's vulnerability. It is "non-vulnerable" otherwise. We find that in almost all cases this decision cannot be made without further context. Vulnerable functions are often vulnerable only because a corresponding vulnerability-inducing calling context exists while non-vulnerable functions would often be vulnerable if a corresponding context existed. But why do ML4VD techniques achieve high scores even though there is demonstrably not enough information in these samples? Spurious correlations: We find that high scores can be achieved even when only word counts are available. This shows that these datasets can be exploited to achieve high scores without actually detecting any security vulnerabilities. We conclude that the prevailing problem statement of ML4VD is ill-defined and call into question the internal validity of this growing body of work. Constructively, we call for more effective benchmarking methodologies to evaluate the true capabilities of ML4VD, propose alternative problem statements, and examine broader implications for the evaluation of machine learning and programming analysis research.

Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection

TL;DR

Abstract

Paper Structure (15 sections, 11 figures, 3 tables)

This paper contains 15 sections, 11 figures, 3 tables.

Introduction
Background
Literature Survey
Methodology
Results
Empirical Study Design
Methodology RQ.1
Methodology RQ.2
Results
Threats to the Validity
Internal validity
External validity
Construct Validity
Discussion and Future Work
Data Availability

Figures (11)

Figure 1: Context-dependent vulnerability (CVE-2021-29599) in DiverseVul dataset. If the function is called with num_splits=0, it crashes with a division-by-zero in Line 7.
Figure 2: Literature survey results for the 81 ML4VD papers we identified in the top Software Engineering (SE) and Security conferences and journals. \ref{['fig:survey:a']} shows how the papers define the problem of ML4VD. Note that a paper may use multiple granularities, which explains why the numbers in \ref{['fig:survey:a']} do not add up to 100%. \ref{['fig:survey:b']} shows how many papers were published each year since 2020.
Figure 3: Datasets that are used by ML4VD papers published at the top Software Engineering and Security conferences and journals over the last five years. Datasets that were used only once are displayed as "Other".
Figure 4: Popularity measured by citations of the datasets we selected for our empirical study.
Figure 5: Empirical Study Design: Our manual labeling process to determine what proportion of security vulnerabilities in popular datasets can be detected without considering additional context beyond the function-level.
...and 6 more figures

Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection

TL;DR

Abstract

Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (11)