Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help?

Triet H. M. Le; M. Ali Babar

Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help?

Triet H. M. Le, M. Ali Babar

TL;DR

Software vulnerability assessment using CVSS metrics suffers from severe data imbalance across classes. The authors perform a large-scale, model-agnostic evaluation of nine data augmentation techniques on over 180k SV descriptions to rebalance the data and improve seven CVSS metric predictions. They find that mitigating data imbalance yields MCC gains up to 31.8% and that simple text augmentation can outperform baselines across tasks, with combinations of text edits performing best on average. The study provides practical baselines, releases code and models for reproducibility, and suggests directions for developing more robust SV prioritization pipelines in practice.

Abstract

Background: Software Vulnerability (SV) assessment is increasingly adopted to address the ever-increasing volume and complexity of SVs. Data-driven approaches have been widely used to automate SV assessment tasks, particularly the prediction of the Common Vulnerability Scoring System (CVSS) metrics such as exploitability, impact, and severity. SV assessment suffers from the imbalanced distributions of the CVSS classes, but such data imbalance has been hardly understood and addressed in the literature. Aims: We conduct a large-scale study to quantify the impacts of data imbalance and mitigate the issue for SV assessment through the use of data augmentation. Method: We leverage nine data augmentation techniques to balance the class distributions of the CVSS metrics. We then compare the performance of SV assessment models with and without leveraging the augmented data. Results: Through extensive experiments on 180k+ real-world SVs, we show that mitigating data imbalance can significantly improve the predictive performance of models for all the CVSS tasks, by up to 31.8% in Matthews Correlation Coefficient. We also discover that simple text augmentation like combining random text insertion, deletion, and replacement can outperform the baseline across the board. Conclusions: Our study provides the motivation and the first promising step toward tackling data imbalance for effective SV assessment.

Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help?

TL;DR

Abstract

Paper Structure (30 sections, 4 figures, 3 tables)

This paper contains 30 sections, 4 figures, 3 tables.

Introduction
Background and Motivation
CVSS-Based SV Assessment
Data Augmentation for SV Assessment with Data Imbalance
Case Study Design and Setup
Research Questions
Dataset
Studied Data Augmentation Techniques
Data Sampling
Simple Text Augmentation
Contextual Text Augmentation
Studied SV Assessment Models
Random Forest (RF) + TF-IDF model
RF + Doc2Vec model
Convolutional Neural Network (CNN) model
...and 15 more sections

Figures (4)

Figure 1: Data percentages (%) of the minority and the majority classes of the seven CVSS metrics of the SVs collected from National Vulnerability Database, illustrating the data imbalance issue for SV assessment. Note: The percentages do not add up to 100% as each CVSS metric has three classes.
Figure 2: Overview of the research methods used for the investigation of data augmentation for different SV assessment tasks.
Figure 3: Data class distributions of the seven CVSS metrics used for SV assessment. Note: The total number of the collected SVs is 180,087.
Figure 4: Percentage (%) differences in testing SV assessment performance (F1-Score and MCC) between using different data augmentation techniques and the baseline (without data augmentation) across different model types.

Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help?

TL;DR

Abstract

Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help?

Authors

TL;DR

Abstract

Table of Contents

Figures (4)