Table of Contents
Fetching ...

Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection

Niklas Risse, Marcel Böhme

TL;DR

The paper questions whether high vulnerability-detection scores reflect genuine capability or reliance on dataset-specific signals. It introduces two benchmarking algorithms—A1 for detecting overfitting to semantic-code changes and A2 for assessing vulnerability-vs-patch generalization—along with a new VulnPatchPairs dataset to stress-test generalization. Across six token-based ML4VD techniques and two datasets, the study finds persistent overfitting to unrelated features and poor generalization to patches, even when data augmentation improves standard metrics within the same augmentations. The proposed evaluation framework reveals fundamental limits of current ML4VD methods and points to the need for robust, out-of-distribution strategies to safely deploy vulnerability detectors in real software engineering settings.

Abstract

Recent results of machine learning for automatic vulnerability detection (ML4VD) have been very promising. Given only the source code of a function $f$, ML4VD techniques can decide if $f$ contains a security flaw with up to 70% accuracy. However, as evident in our own experiments, the same top-performing models are unable to distinguish between functions that contain a vulnerability and functions where the vulnerability is patched. So, how can we explain this contradiction and how can we improve the way we evaluate ML4VD techniques to get a better picture of their actual capabilities? In this paper, we identify overfitting to unrelated features and out-of-distribution generalization as two problems, which are not captured by the traditional approach of evaluating ML4VD techniques. As a remedy, we propose a novel benchmarking methodology to help researchers better evaluate the true capabilities and limits of ML4VD techniques. Specifically, we propose (i) to augment the training and validation dataset according to our cross-validation algorithm, where a semantic preserving transformation is applied during the augmentation of either the training set or the testing set, and (ii) to augment the testing set with code snippets where the vulnerabilities are patched. Using six ML4VD techniques and two datasets, we find (a) that state-of-the-art models severely overfit to unrelated features for predicting the vulnerabilities in the testing data, (b) that the performance gained by data augmentation does not generalize beyond the specific augmentations applied during training, and (c) that state-of-the-art ML4VD techniques are unable to distinguish vulnerable functions from their patches.

Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection

TL;DR

The paper questions whether high vulnerability-detection scores reflect genuine capability or reliance on dataset-specific signals. It introduces two benchmarking algorithms—A1 for detecting overfitting to semantic-code changes and A2 for assessing vulnerability-vs-patch generalization—along with a new VulnPatchPairs dataset to stress-test generalization. Across six token-based ML4VD techniques and two datasets, the study finds persistent overfitting to unrelated features and poor generalization to patches, even when data augmentation improves standard metrics within the same augmentations. The proposed evaluation framework reveals fundamental limits of current ML4VD methods and points to the need for robust, out-of-distribution strategies to safely deploy vulnerability detectors in real software engineering settings.

Abstract

Recent results of machine learning for automatic vulnerability detection (ML4VD) have been very promising. Given only the source code of a function , ML4VD techniques can decide if contains a security flaw with up to 70% accuracy. However, as evident in our own experiments, the same top-performing models are unable to distinguish between functions that contain a vulnerability and functions where the vulnerability is patched. So, how can we explain this contradiction and how can we improve the way we evaluate ML4VD techniques to get a better picture of their actual capabilities? In this paper, we identify overfitting to unrelated features and out-of-distribution generalization as two problems, which are not captured by the traditional approach of evaluating ML4VD techniques. As a remedy, we propose a novel benchmarking methodology to help researchers better evaluate the true capabilities and limits of ML4VD techniques. Specifically, we propose (i) to augment the training and validation dataset according to our cross-validation algorithm, where a semantic preserving transformation is applied during the augmentation of either the training set or the testing set, and (ii) to augment the testing set with code snippets where the vulnerabilities are patched. Using six ML4VD techniques and two datasets, we find (a) that state-of-the-art models severely overfit to unrelated features for predicting the vulnerabilities in the testing data, (b) that the performance gained by data augmentation does not generalize beyond the specific augmentations applied during training, and (c) that state-of-the-art ML4VD techniques are unable to distinguish vulnerable functions from their patches.
Paper Structure (20 sections, 11 figures, 3 tables, 2 algorithms)

This paper contains 20 sections, 11 figures, 3 tables, 2 algorithms.

Figures (11)

  • Figure 1: Example of a simple semantic preserving transformation. The change (orange) has no effect on the vulnerability label. Both code snippets contain a security vulnerability (integer overflow in line 4). The code was taken from the Ffmpeg GitHub repository (URL: https://github.com/FFmpeg/FFmpeg/commit/92da2309) and is part of the CodeXGLUE/Devign dataset.
  • Figure 2: Visualization of \ref{['alg:methodology_1']}, which we created to detect overfitting of ML4VD techniques to unrelated features introduced by data augmentation. Colors represent that either only testing data is augmented (blue), training- and testing data are augmented using the same (orange), or different augmentation methods (green).
  • Figure 3: Visualization of \ref{['alg:methodology_2']}, which we created to test whether ML4VD techniques are able to generalize to a modified setting, which requires to distinguish between vulnerabilities and patches.
  • Figure 4: Visualization of the collection process for our new dataset VulnPatchPairs.
  • Figure 5: Effects of augmenting the testing data and the training data using the same semantic preserving transformations.
  • ...and 6 more figures