Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection
Niklas Risse, Marcel Böhme
TL;DR
The paper questions whether high vulnerability-detection scores reflect genuine capability or reliance on dataset-specific signals. It introduces two benchmarking algorithms—A1 for detecting overfitting to semantic-code changes and A2 for assessing vulnerability-vs-patch generalization—along with a new VulnPatchPairs dataset to stress-test generalization. Across six token-based ML4VD techniques and two datasets, the study finds persistent overfitting to unrelated features and poor generalization to patches, even when data augmentation improves standard metrics within the same augmentations. The proposed evaluation framework reveals fundamental limits of current ML4VD methods and points to the need for robust, out-of-distribution strategies to safely deploy vulnerability detectors in real software engineering settings.
Abstract
Recent results of machine learning for automatic vulnerability detection (ML4VD) have been very promising. Given only the source code of a function $f$, ML4VD techniques can decide if $f$ contains a security flaw with up to 70% accuracy. However, as evident in our own experiments, the same top-performing models are unable to distinguish between functions that contain a vulnerability and functions where the vulnerability is patched. So, how can we explain this contradiction and how can we improve the way we evaluate ML4VD techniques to get a better picture of their actual capabilities? In this paper, we identify overfitting to unrelated features and out-of-distribution generalization as two problems, which are not captured by the traditional approach of evaluating ML4VD techniques. As a remedy, we propose a novel benchmarking methodology to help researchers better evaluate the true capabilities and limits of ML4VD techniques. Specifically, we propose (i) to augment the training and validation dataset according to our cross-validation algorithm, where a semantic preserving transformation is applied during the augmentation of either the training set or the testing set, and (ii) to augment the testing set with code snippets where the vulnerabilities are patched. Using six ML4VD techniques and two datasets, we find (a) that state-of-the-art models severely overfit to unrelated features for predicting the vulnerabilities in the testing data, (b) that the performance gained by data augmentation does not generalize beyond the specific augmentations applied during training, and (c) that state-of-the-art ML4VD techniques are unable to distinguish vulnerable functions from their patches.
