Table of Contents
Fetching ...

Automatic Data Labeling for Software Vulnerability Prediction Models: How Far Are We?

Triet H. M. Le, M. Ali Babar

TL;DR

The paper investigates the quality and utility of auto-labeled software vulnerability data produced by D2A compared with traditional human-labeled VFC data. By curating OpenSSL and FFmpeg datasets and applying diverse code features and ML classifiers at the file level, it quantifies the noise in auto-labeled SVs and appraises their impact on predictive performance, including the role of noise-reduction techniques like Confident Learning. Key findings show that more than half of auto-labeled SVs are noisy and may not align with human-labeled ground truth, yet models trained on auto-labeled data can achieve substantial gains (up to MCC increases of around 0.22) and often outperform models trained only on human-labeled data, especially when data are combined. Noise-reduction methods improve robustness and can maintain strong performance with fewer auto-labeled samples, though care is needed to avoid discarding true vulnerabilities. The study offers evidence-based guidance on using auto-labeled SV data to scale SV prediction while highlighting avenues for improving labeling quality and noise handling in practice.

Abstract

Background: Software Vulnerability (SV) prediction needs large-sized and high-quality data to perform well. Current SV datasets mostly require expensive labeling efforts by experts (human-labeled) and thus are limited in size. Meanwhile, there are growing efforts in automatic SV labeling at scale. However, the fitness of auto-labeled data for SV prediction is still largely unknown. Aims: We quantitatively and qualitatively study the quality and use of the state-of-the-art auto-labeled SV data, D2A, for SV prediction. Method: Using multiple sources and manual validation, we curate clean SV data from human-labeled SV-fixing commits in two well-known projects for investigating the auto-labeled counterparts. Results: We discover that 50+% of the auto-labeled SVs are noisy (incorrectly labeled), and they hardly overlap with the publicly reported ones. Yet, SV prediction models utilizing the noisy auto-labeled SVs can perform up to 22% and 90% better in Matthews Correlation Coefficient and Recall, respectively, than the original models. We also reveal the promises and difficulties of applying noise-reduction methods for automatically addressing the noise in auto-labeled SV data to maximize the data utilization for SV prediction. Conclusions: Our study informs the benefits and challenges of using auto-labeled SVs, paving the way for large-scale SV prediction.

Automatic Data Labeling for Software Vulnerability Prediction Models: How Far Are We?

TL;DR

The paper investigates the quality and utility of auto-labeled software vulnerability data produced by D2A compared with traditional human-labeled VFC data. By curating OpenSSL and FFmpeg datasets and applying diverse code features and ML classifiers at the file level, it quantifies the noise in auto-labeled SVs and appraises their impact on predictive performance, including the role of noise-reduction techniques like Confident Learning. Key findings show that more than half of auto-labeled SVs are noisy and may not align with human-labeled ground truth, yet models trained on auto-labeled data can achieve substantial gains (up to MCC increases of around 0.22) and often outperform models trained only on human-labeled data, especially when data are combined. Noise-reduction methods improve robustness and can maintain strong performance with fewer auto-labeled samples, though care is needed to avoid discarding true vulnerabilities. The study offers evidence-based guidance on using auto-labeled SV data to scale SV prediction while highlighting avenues for improving labeling quality and noise handling in practice.

Abstract

Background: Software Vulnerability (SV) prediction needs large-sized and high-quality data to perform well. Current SV datasets mostly require expensive labeling efforts by experts (human-labeled) and thus are limited in size. Meanwhile, there are growing efforts in automatic SV labeling at scale. However, the fitness of auto-labeled data for SV prediction is still largely unknown. Aims: We quantitatively and qualitatively study the quality and use of the state-of-the-art auto-labeled SV data, D2A, for SV prediction. Method: Using multiple sources and manual validation, we curate clean SV data from human-labeled SV-fixing commits in two well-known projects for investigating the auto-labeled counterparts. Results: We discover that 50+% of the auto-labeled SVs are noisy (incorrectly labeled), and they hardly overlap with the publicly reported ones. Yet, SV prediction models utilizing the noisy auto-labeled SVs can perform up to 22% and 90% better in Matthews Correlation Coefficient and Recall, respectively, than the original models. We also reveal the promises and difficulties of applying noise-reduction methods for automatically addressing the noise in auto-labeled SV data to maximize the data utilization for SV prediction. Conclusions: Our study informs the benefits and challenges of using auto-labeled SVs, paving the way for large-scale SV prediction.
Paper Structure (22 sections, 5 figures, 4 tables)

This paper contains 22 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Exemplary vulnerable file corresponding to CVE-2020-1967 extracted from the respective vulnerability-fixing commit in the OpenSSL project.
  • Figure 2: Research methods for answering the three research questions. Note: Non-VFCs are the commits not fixing SVs.
  • Figure 3: The 5-round training, validation, & testing file-level SV prediction models. Notes: The splits are of equal size. Any index exceeding five would be wrapped around (e.g., 6%5 = 1).
  • Figure 4: The relationship between the auto-labeled (D2A) SVs & human-labeled SVs. Notes: (*) indicates that the results were obtained from a subset of 68 samples from each of the OpenSSL & FFmpeg projects. The overlapping VFCs were only from the FFmpeg project.
  • Figure 5: Testing performance of file-level SV prediction models using the three data types. Notes: The blue vertical dashed lines are the median performance values of the models using only human-labeled SVs. MCC is the main evaluation measure.