Table of Contents
Fetching ...

Deep Learning-Based Out-of-distribution Source Code Data Identification: How Far Have We Gone?

Van Nguyen, Xingliang Yuan, Tingmin Wu, Surya Nepal, Marthie Grobler, Carsten Rudolph

TL;DR

This work tackles the challenge of detecting out-of-distribution (OOD) source code data in software vulnerability detection, a problem that undermines reliability when models encounter unseen CWE categories. It introduces LEO, a deep learning framework that learns code characteristics through an important-statement selector guided by information-theoretic objectives and enhances representation learning with cluster-contrastive learning. LEO uses a two-phase process: during training, it selects vulnerable-pattern-relevant statements and aligns latent representations via mutual information and cluster-based constraints; during inference, it computes a cluster-conditioned Mahalanobis outlier score to identify OOD samples. Evaluated on the DiverseVul dataset against strong baselines, LEO substantially improves FPR at 95% TPR, AUROC, and AUPR across multiple ID/OOD CWE configurations, demonstrating robust OOD detection and promising practical impact for securing software systems.

Abstract

Software vulnerabilities (SVs) have become a common, serious, and crucial concern to safety-critical security systems. That leads to significant progress in the use of AI-based methods for software vulnerability detection (SVD). In practice, although AI-based methods have been achieving promising performances in SVD and other domain applications (e.g., computer vision), they are well-known to fail in detecting the ground-truth label of input data (referred to as out-of-distribution, OOD, data) lying far away from the training data distribution (i.e., in-distribution, ID). This drawback leads to serious issues where the models fail to indicate when they are likely mistaken. To address this problem, OOD detectors (i.e., determining whether an input is ID or OOD) have been applied before feeding the input data to the downstream AI-based modules. While OOD detection has been widely designed for computer vision and medical diagnosis applications, automated AI-based techniques for OOD source code data detection have not yet been well-studied and explored. To this end, in this paper, we propose an innovative deep learning-based approach addressing the OOD source code data identification problem. Our method is derived from an information-theoretic perspective with the use of innovative cluster-contrastive learning to effectively learn and leverage source code characteristics, enhancing data representation learning for solving the problem. The rigorous and comprehensive experiments on real-world source code datasets show the effectiveness and advancement of our approach compared to state-of-the-art baselines by a wide margin. In short, on average, our method achieves a significantly higher performance from around 15.27%, 7.39%, and 4.93% on the FPR, AUROC, and AUPR measures, respectively, in comparison with the baselines.

Deep Learning-Based Out-of-distribution Source Code Data Identification: How Far Have We Gone?

TL;DR

This work tackles the challenge of detecting out-of-distribution (OOD) source code data in software vulnerability detection, a problem that undermines reliability when models encounter unseen CWE categories. It introduces LEO, a deep learning framework that learns code characteristics through an important-statement selector guided by information-theoretic objectives and enhances representation learning with cluster-contrastive learning. LEO uses a two-phase process: during training, it selects vulnerable-pattern-relevant statements and aligns latent representations via mutual information and cluster-based constraints; during inference, it computes a cluster-conditioned Mahalanobis outlier score to identify OOD samples. Evaluated on the DiverseVul dataset against strong baselines, LEO substantially improves FPR at 95% TPR, AUROC, and AUPR across multiple ID/OOD CWE configurations, demonstrating robust OOD detection and promising practical impact for securing software systems.

Abstract

Software vulnerabilities (SVs) have become a common, serious, and crucial concern to safety-critical security systems. That leads to significant progress in the use of AI-based methods for software vulnerability detection (SVD). In practice, although AI-based methods have been achieving promising performances in SVD and other domain applications (e.g., computer vision), they are well-known to fail in detecting the ground-truth label of input data (referred to as out-of-distribution, OOD, data) lying far away from the training data distribution (i.e., in-distribution, ID). This drawback leads to serious issues where the models fail to indicate when they are likely mistaken. To address this problem, OOD detectors (i.e., determining whether an input is ID or OOD) have been applied before feeding the input data to the downstream AI-based modules. While OOD detection has been widely designed for computer vision and medical diagnosis applications, automated AI-based techniques for OOD source code data detection have not yet been well-studied and explored. To this end, in this paper, we propose an innovative deep learning-based approach addressing the OOD source code data identification problem. Our method is derived from an information-theoretic perspective with the use of innovative cluster-contrastive learning to effectively learn and leverage source code characteristics, enhancing data representation learning for solving the problem. The rigorous and comprehensive experiments on real-world source code datasets show the effectiveness and advancement of our approach compared to state-of-the-art baselines by a wide margin. In short, on average, our method achieves a significantly higher performance from around 15.27%, 7.39%, and 4.93% on the FPR, AUROC, and AUPR measures, respectively, in comparison with the baselines.
Paper Structure (44 sections, 6 equations, 6 figures, 4 tables)

This paper contains 44 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: A visualization of our proposed LEO method for effectively improving the source code data representation learning process to solve the out-of-distribution (OOD) source code data identification problem.
  • Figure 2: A visualization of the joint training process of the classifier and the selection model process guided by maximizing mutual information between $\tilde{X}$ and $Y$.
  • Figure 3: Two examples of vulnerability patterns causing the buffer overflow error cwe. The left-hand function shows an example of the buffer copy without checking the size of the input. The right-hand function exhibits an example of the improper validation of an array index. It only verifies the array index against the maximum length, not the minimum value.
  • Figure 4: A 2D t-SNE projection for the data representation distribution of the in-distribution data (blue color) and the out-of-distribution vulnerable data (red color) in the latent space (i.e., where source code data from CWE863 and CWE287 categories are used as in-distribution data and out-of-distribution data, respectively) of our proposed LEO method and the baselines.
  • Figure 5: The average results for the FPR (at TPR 95%), AUROC, and AUPR measures, respectively, of our LEO method and baselines in all cases of the in-distribution and out-of-distribution CWE categories mentioned in Tables \ref{['tab:my_label1t']} and \ref{['tab:my_label23t']}. Note that for the FPR measure, the smaller value is better while for the AUROC, and AUPR measures, the higher value is better. We denote the Standard DNN, Outlier Exposure, VulDeePecker, and CodeBERT methods as SDNN, OE, VDP, and CBERT for short.
  • ...and 1 more figures