Are Latent Vulnerabilities Hidden Gems for Software Vulnerability Prediction? An Empirical Study

Triet H. M. Le; Xiaoning Du; M. Ali Babar

Are Latent Vulnerabilities Hidden Gems for Software Vulnerability Prediction? An Empirical Study

Triet H. M. Le, Xiaoning Du, M. Ali Babar

TL;DR

The paper investigates latent vulnerabilities that exist between the introduction of a vulnerability and its fix, arguing that current SV datasets overlook these latent SVs. By adapting the state-of-the-art V-SZZ method, the authors identify over 100k latent vulnerable functions in Big-Vul and Devign and demonstrate that these latent SVs can increase dataset size by about 4x and correct thousands of mislabeled cases, with a small ~6% noise. Using LineVul, they show latent SVs improve function-level SV prediction by up to 24.5% F1 and line-level vulnerability localization by up to 67%, while also enabling recall of latent SVs not captured by SZZ alone. The results support latent SVs as a practical data-augmentation strategy, particularly valuable for low-resource projects, and highlight avenues for improved SV dataset quality and downstream predictive performance, with data and code made publicly available for replication.

Abstract

Collecting relevant and high-quality data is integral to the development of effective Software Vulnerability (SV) prediction models. Most of the current SV datasets rely on SV-fixing commits to extract vulnerable functions and lines. However, none of these datasets have considered latent SVs existing between the introduction and fix of the collected SVs. There is also little known about the usefulness of these latent SVs for SV prediction. To bridge these gaps, we conduct a large-scale study on the latent vulnerable functions in two commonly used SV datasets and their utilization for function-level and line-level SV predictions. Leveraging the state-of-the-art SZZ algorithm, we identify more than 100k latent vulnerable functions in the studied datasets. We find that these latent functions can increase the number of SVs by 4x on average and correct up to 5k mislabeled functions, yet they have a noise level of around 6%. Despite the noise, we show that the state-of-the-art SV prediction model can significantly benefit from such latent SVs. The improvements are up to 24.5% in the performance (F1-Score) of function-level SV predictions and up to 67% in the effectiveness of localizing vulnerable lines. Overall, our study presents the first promising step toward the use of latent SVs to improve the quality of SV datasets and enhance the performance of SV prediction tasks.

Are Latent Vulnerabilities Hidden Gems for Software Vulnerability Prediction? An Empirical Study

TL;DR

Abstract

Paper Structure (28 sections, 4 figures, 3 tables)

This paper contains 28 sections, 4 figures, 3 tables.

Introduction
Background and Motivation
SV Data Labeling in Current Datasets
Missing Consideration of Latent SVs
Research Questions
Data Collection
Studied SV Datasets
SZZ-Based Identification of Latent SVs
Quality and Prevalence of Latent SVs in Existing Datasets (RQ1)
Quality of Latent SVs
Methods of Manual Validation
Results of Manual Validation
Prevalence of Latent SVs
Methods
Results
...and 13 more sections

Figures (4)

Figure 1: Procedure of current datasets for collecting vulnerable functions/lines from vulnerability-fixing commits.
Figure 2: An exemplary latent vulnerable function existing between its introduction and fixing commits in the FFmpeg project. Note: This latent vulnerable function was originally labeled as non-vulnerable, as it belonged to a non-vulnerability fixing commit of the Devign dataset zhou2019devign. Its name was changed in one of the intermediate commits.
Figure 3: An example of using V-SZZ to identify the Vulnerability-Introducing Commit from a Vulnerability-Fixing Commit with intermediate commits performing only refactoring. Notes: The arrows show the tracking of the original vulnerable line for (i = 0; i < name_len; i++) across commits. The commits were taken from the FFmpeg project of the Devign dataset.
Figure 4: Workflow of evaluating the impacts of latent vulnerable functions on function-level and line-level SV predictions. Note: SVP stands for SV prediction.

Are Latent Vulnerabilities Hidden Gems for Software Vulnerability Prediction? An Empirical Study

TL;DR

Abstract

Are Latent Vulnerabilities Hidden Gems for Software Vulnerability Prediction? An Empirical Study

Authors

TL;DR

Abstract

Table of Contents

Figures (4)