Table of Contents
Fetching ...

A Study on Mixup-Inspired Augmentation Methods for Software Vulnerability Detection

Seyed Shayan Daneshvar, Da Tan, Shaowei Wang, Carson Leung

TL;DR

The paper addresses the data scarcity and imbalance challenge in deep learning–based vulnerability detection by systematically evaluating five representation-level augmentation methods (Linear Interpolation, Stochastic Perturbation, Linear Extrapolation, Binary Interpolation, Gaussian Scaling) and a conditioned variant on a leading token-based model (LineVul) using the BigVul dataset. It finds that augmentation can improve F1-scores (up to $9.67\%$) but cannot outperform Random Oversampling, which yields a larger improvement (up to $10.82\%$); VGX underperforms compared to embedding-based augmentation. The study provides practical guidance favoring ROS for balancing real-world vulnerability data, while also showing that conditioned augmentation can offer targeted improvements with some added noise. It suggests future work on more intrusive, graph-aware augmentation approaches to further enhance vulnerability detection performance and generalization. The work contributes a rigorous, reproducible assessment of augmentation strategies in a domain with high-stakes security implications.

Abstract

Various deep learning (DL) methods have recently been utilized to detect software vulnerabilities. Real-world software vulnerability datasets are rare and hard to acquire, as there is no simple metric for classifying vulnerability. Such datasets are heavily imbalanced, and none of the current datasets are considered huge for DL models. To tackle these problems, a recent work has tried to augment the dataset using the source code and generate realistic single-statement vulnerabilities, which is not quite practical and requires manual checking of the generated vulnerabilities. In this paper, we aim to explore the augmentation of vulnerabilities at the representation level to help current models learn better, which has never been done before to the best of our knowledge. We implement and evaluate five augmentation techniques that augment the embedding of the data and have recently been used for code search, which is a completely different software engineering task. We also introduced a conditioned version of those augmentation methods, which ensures the augmentation does not change the vulnerable section of the vector representation. We show that such augmentation methods can be helpful and increase the F1-score by up to 9.67%, yet they cannot beat Random Oversampling when balancing datasets, which increases the F1-score by 10.82%.

A Study on Mixup-Inspired Augmentation Methods for Software Vulnerability Detection

TL;DR

The paper addresses the data scarcity and imbalance challenge in deep learning–based vulnerability detection by systematically evaluating five representation-level augmentation methods (Linear Interpolation, Stochastic Perturbation, Linear Extrapolation, Binary Interpolation, Gaussian Scaling) and a conditioned variant on a leading token-based model (LineVul) using the BigVul dataset. It finds that augmentation can improve F1-scores (up to ) but cannot outperform Random Oversampling, which yields a larger improvement (up to ); VGX underperforms compared to embedding-based augmentation. The study provides practical guidance favoring ROS for balancing real-world vulnerability data, while also showing that conditioned augmentation can offer targeted improvements with some added noise. It suggests future work on more intrusive, graph-aware augmentation approaches to further enhance vulnerability detection performance and generalization. The work contributes a rigorous, reproducible assessment of augmentation strategies in a domain with high-stakes security implications.

Abstract

Various deep learning (DL) methods have recently been utilized to detect software vulnerabilities. Real-world software vulnerability datasets are rare and hard to acquire, as there is no simple metric for classifying vulnerability. Such datasets are heavily imbalanced, and none of the current datasets are considered huge for DL models. To tackle these problems, a recent work has tried to augment the dataset using the source code and generate realistic single-statement vulnerabilities, which is not quite practical and requires manual checking of the generated vulnerabilities. In this paper, we aim to explore the augmentation of vulnerabilities at the representation level to help current models learn better, which has never been done before to the best of our knowledge. We implement and evaluate five augmentation techniques that augment the embedding of the data and have recently been used for code search, which is a completely different software engineering task. We also introduced a conditioned version of those augmentation methods, which ensures the augmentation does not change the vulnerable section of the vector representation. We show that such augmentation methods can be helpful and increase the F1-score by up to 9.67%, yet they cannot beat Random Oversampling when balancing datasets, which increases the F1-score by 10.82%.

Paper Structure

This paper contains 17 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Blind augmentation of vulnerabilities using Linear Interpolation, where the augmented embedding is a weighted average of the two embeddings.
  • Figure 2: Conditioned augmentation of vulnerabilities using Linear Interpolation. The vulnerable line and the corresponding tokens are highlighted.