Table of Contents
Fetching ...

An Empirical Study of the Imbalance Issue in Software Vulnerability Detection

Yuejun Guo, Qiang Hu, Qiang Tang, Yves Le Traon

TL;DR

This study empirically investigates the imbalance issue in DL-based software vulnerability detection, revealing that standard training biased toward secure code degrades detection of vulnerable code across nine datasets and two foundation models. It evaluates seven imbalance remedies from data- and model-level perspectives, finding that no single method excels across all metrics, with focal loss improving precision, MFE/CB aiding recall, and over-sampling often boosting F1. The work emphasizes that evaluation should rely on precision, recall, and F1 rather than accuracy or FPR alone, and shows that external factors like vulnerability-type distribution and data shift critically affect remedy effectiveness. The findings underscore the need for a task-specific imbalance solution for vulnerability detection and provide practical guidelines for metric selection and remedy design in real-world settings.

Abstract

Vulnerability detection is crucial to protect software security. Nowadays, deep learning (DL) is the most promising technique to automate this detection task, leveraging its superior ability to extract patterns and representations within extensive code volumes. Despite its promise, DL-based vulnerability detection remains in its early stages, with model performance exhibiting variability across datasets. Drawing insights from other well-explored application areas like computer vision, we conjecture that the imbalance issue (the number of vulnerable code is extremely small) is at the core of the phenomenon. To validate this, we conduct a comprehensive empirical study involving nine open-source datasets and two state-of-the-art DL models. The results confirm our conjecture. We also obtain insightful findings on how existing imbalance solutions perform in vulnerability detection. It turns out that these solutions perform differently as well across datasets and evaluation metrics. Specifically: 1) Focal loss is more suitable to improve the precision, 2) mean false error and class-balanced loss encourages the recall, and 3) random over-sampling facilitates the F1-measure. However, none of them excels across all metrics. To delve deeper, we explore external influences on these solutions and offer insights for developing new solutions.

An Empirical Study of the Imbalance Issue in Software Vulnerability Detection

TL;DR

This study empirically investigates the imbalance issue in DL-based software vulnerability detection, revealing that standard training biased toward secure code degrades detection of vulnerable code across nine datasets and two foundation models. It evaluates seven imbalance remedies from data- and model-level perspectives, finding that no single method excels across all metrics, with focal loss improving precision, MFE/CB aiding recall, and over-sampling often boosting F1. The work emphasizes that evaluation should rely on precision, recall, and F1 rather than accuracy or FPR alone, and shows that external factors like vulnerability-type distribution and data shift critically affect remedy effectiveness. The findings underscore the need for a task-specific imbalance solution for vulnerability detection and provide practical guidelines for metric selection and remedy design in real-world settings.

Abstract

Vulnerability detection is crucial to protect software security. Nowadays, deep learning (DL) is the most promising technique to automate this detection task, leveraging its superior ability to extract patterns and representations within extensive code volumes. Despite its promise, DL-based vulnerability detection remains in its early stages, with model performance exhibiting variability across datasets. Drawing insights from other well-explored application areas like computer vision, we conjecture that the imbalance issue (the number of vulnerable code is extremely small) is at the core of the phenomenon. To validate this, we conduct a comprehensive empirical study involving nine open-source datasets and two state-of-the-art DL models. The results confirm our conjecture. We also obtain insightful findings on how existing imbalance solutions perform in vulnerability detection. It turns out that these solutions perform differently as well across datasets and evaluation metrics. Specifically: 1) Focal loss is more suitable to improve the precision, 2) mean false error and class-balanced loss encourages the recall, and 3) random over-sampling facilitates the F1-measure. However, none of them excels across all metrics. To delve deeper, we explore external influences on these solutions and offer insights for developing new solutions.
Paper Structure (20 sections, 7 equations, 3 figures, 6 tables)

This paper contains 20 sections, 7 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An example of vulnerable code: a C-language function from the LibTIFF project libtiffweb. This function is tagged with the "Denial of Service" vulnerability in the Lin2018 dataset (please refer to Section \ref{['subsec:data']} for more details). The code framed in a red rectangle highlights a concern about handling cases of division by zero when $m.i[1]==0$.
  • Figure 2: Overview of the empirical study. During the training procedure (top), a foundation DL model is fine-tuned to minimize the error between predicted and ground truth labels. In the test time (bottom), the trained model is used to make predictions on test data. ❶❷❸❹ refer to four research questions.
  • Figure 3: Vulnerability type distribution in each split set (training, validation, and test). $x$-axis: vulnerability type ID (Please refer to Table \ref{['tab:vulist']} for more details.). $y$-axis: number of samples in the corresponding set. Source: Lin2018.