Are Latent Vulnerabilities Hidden Gems for Software Vulnerability Prediction? An Empirical Study
Triet H. M. Le, Xiaoning Du, M. Ali Babar
TL;DR
The paper investigates latent vulnerabilities that exist between the introduction of a vulnerability and its fix, arguing that current SV datasets overlook these latent SVs. By adapting the state-of-the-art V-SZZ method, the authors identify over 100k latent vulnerable functions in Big-Vul and Devign and demonstrate that these latent SVs can increase dataset size by about 4x and correct thousands of mislabeled cases, with a small ~6% noise. Using LineVul, they show latent SVs improve function-level SV prediction by up to 24.5% F1 and line-level vulnerability localization by up to 67%, while also enabling recall of latent SVs not captured by SZZ alone. The results support latent SVs as a practical data-augmentation strategy, particularly valuable for low-resource projects, and highlight avenues for improved SV dataset quality and downstream predictive performance, with data and code made publicly available for replication.
Abstract
Collecting relevant and high-quality data is integral to the development of effective Software Vulnerability (SV) prediction models. Most of the current SV datasets rely on SV-fixing commits to extract vulnerable functions and lines. However, none of these datasets have considered latent SVs existing between the introduction and fix of the collected SVs. There is also little known about the usefulness of these latent SVs for SV prediction. To bridge these gaps, we conduct a large-scale study on the latent vulnerable functions in two commonly used SV datasets and their utilization for function-level and line-level SV predictions. Leveraging the state-of-the-art SZZ algorithm, we identify more than 100k latent vulnerable functions in the studied datasets. We find that these latent functions can increase the number of SVs by 4x on average and correct up to 5k mislabeled functions, yet they have a noise level of around 6%. Despite the noise, we show that the state-of-the-art SV prediction model can significantly benefit from such latent SVs. The improvements are up to 24.5% in the performance (F1-Score) of function-level SV predictions and up to 67% in the effectiveness of localizing vulnerable lines. Overall, our study presents the first promising step toward the use of latent SVs to improve the quality of SV datasets and enhance the performance of SV prediction tasks.
