Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help?
Triet H. M. Le, M. Ali Babar
TL;DR
Software vulnerability assessment using CVSS metrics suffers from severe data imbalance across classes. The authors perform a large-scale, model-agnostic evaluation of nine data augmentation techniques on over 180k SV descriptions to rebalance the data and improve seven CVSS metric predictions. They find that mitigating data imbalance yields MCC gains up to 31.8% and that simple text augmentation can outperform baselines across tasks, with combinations of text edits performing best on average. The study provides practical baselines, releases code and models for reproducibility, and suggests directions for developing more robust SV prioritization pipelines in practice.
Abstract
Background: Software Vulnerability (SV) assessment is increasingly adopted to address the ever-increasing volume and complexity of SVs. Data-driven approaches have been widely used to automate SV assessment tasks, particularly the prediction of the Common Vulnerability Scoring System (CVSS) metrics such as exploitability, impact, and severity. SV assessment suffers from the imbalanced distributions of the CVSS classes, but such data imbalance has been hardly understood and addressed in the literature. Aims: We conduct a large-scale study to quantify the impacts of data imbalance and mitigate the issue for SV assessment through the use of data augmentation. Method: We leverage nine data augmentation techniques to balance the class distributions of the CVSS metrics. We then compare the performance of SV assessment models with and without leveraging the augmented data. Results: Through extensive experiments on 180k+ real-world SVs, we show that mitigating data imbalance can significantly improve the predictive performance of models for all the CVSS tasks, by up to 31.8% in Matthews Correlation Coefficient. We also discover that simple text augmentation like combining random text insertion, deletion, and replacement can outperform the baseline across the board. Conclusions: Our study provides the motivation and the first promising step toward tackling data imbalance for effective SV assessment.
