Table of Contents
Fetching ...

Leveraging Self-Paced Learning for Software Vulnerability Detection

Zeru Cheng, Yanjing Yang, He Zhang, Lanxin Yang, Jinghao Hu, Jinwei Xu, Bohan Liu, Haifeng Shen

TL;DR

This work tackles software vulnerability detection under the pervasive challenge of low-quality training data. It introduces SPLVD, a self-paced, curriculum-inspired framework that dynamically selects training source code by estimated difficulty and evolving training age, built atop UniXcoder with an LSTM classifier. Empirical results on BigVul, Devign, ReVeal, and OpenHarmony show SPLVD achieving top F1 and MCC, with ablations confirming the value of self-paced learning and improved high-confidence detections. The approach demonstrates strong practical potential by delivering high-precision alerts suitable for real-world vulnerability review while acknowledging trade-offs in training time and the need for manual verification. Overall, SPLVD advances vulnerability detection by aligning learning with data quality and task difficulty, enabling more reliable production deployments.

Abstract

Software vulnerabilities are major risks to software systems. Recently, researchers have proposed many deep learning approaches to detect software vulnerabilities. However, their accuracy is limited in practice. One of the main causes is low-quality training data (i.e., source code). To this end, we propose a new approach: SPLVD (Self-Paced Learning for Software Vulnerability Detection). SPLVD dynamically selects source code for model training based on the stage of training, which simulates the human learning process progressing from easy to hard. SPLVD has a data selector that is specifically designed for the vulnerability detection task, which enables it to prioritize the learning of easy source code. Before each training epoch, SPLVD uses the data selector to recalculate the difficulty of the source code, select new training source code, and update the data selector. When evaluating SPLVD, we first use three benchmark datasets with over 239K source code in which 25K are vulnerable for standard evaluations. Experimental results demonstrate that SPLVD achieves the highest F1 of 89.2%, 68.7%, and 43.5%, respectively, outperforming the state-of-the-art approaches. Then we collect projects from OpenHarmony, a new ecosystem that has not been learned by general LLMs, to evaluate SPLVD further. SPLVD achieves the highest precision of 90.9%, demonstrating its practical effectiveness.

Leveraging Self-Paced Learning for Software Vulnerability Detection

TL;DR

This work tackles software vulnerability detection under the pervasive challenge of low-quality training data. It introduces SPLVD, a self-paced, curriculum-inspired framework that dynamically selects training source code by estimated difficulty and evolving training age, built atop UniXcoder with an LSTM classifier. Empirical results on BigVul, Devign, ReVeal, and OpenHarmony show SPLVD achieving top F1 and MCC, with ablations confirming the value of self-paced learning and improved high-confidence detections. The approach demonstrates strong practical potential by delivering high-precision alerts suitable for real-world vulnerability review while acknowledging trade-offs in training time and the need for manual verification. Overall, SPLVD advances vulnerability detection by aligning learning with data quality and task difficulty, enabling more reliable production deployments.

Abstract

Software vulnerabilities are major risks to software systems. Recently, researchers have proposed many deep learning approaches to detect software vulnerabilities. However, their accuracy is limited in practice. One of the main causes is low-quality training data (i.e., source code). To this end, we propose a new approach: SPLVD (Self-Paced Learning for Software Vulnerability Detection). SPLVD dynamically selects source code for model training based on the stage of training, which simulates the human learning process progressing from easy to hard. SPLVD has a data selector that is specifically designed for the vulnerability detection task, which enables it to prioritize the learning of easy source code. Before each training epoch, SPLVD uses the data selector to recalculate the difficulty of the source code, select new training source code, and update the data selector. When evaluating SPLVD, we first use three benchmark datasets with over 239K source code in which 25K are vulnerable for standard evaluations. Experimental results demonstrate that SPLVD achieves the highest F1 of 89.2%, 68.7%, and 43.5%, respectively, outperforming the state-of-the-art approaches. Then we collect projects from OpenHarmony, a new ecosystem that has not been learned by general LLMs, to evaluate SPLVD further. SPLVD achieves the highest precision of 90.9%, demonstrating its practical effectiveness.

Paper Structure

This paper contains 31 sections, 5 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: The Unrelated Code Is Wrongly Labeled
  • Figure 2: The Non-Modified Code Is Marked as Vulnerable
  • Figure 3: An Illustration of Self-Paced Learning for Model Training
  • Figure 4: An Overview of the Proposed SPLVD
  • Figure 5: SPLVD Measures the Distribution of Difficulty for Vulnerable Source Code on Three Datasets
  • ...and 3 more figures