Table of Contents
Fetching ...

An Unbiased Transformer Source Code Learning with Semantic Vulnerability Graph

Nafis Tanveer Islam, Gonzalo De La Torre Parra, Dylan Manuel, Elias Bou-Harb, Peyman Najafirad

TL;DR

This work tackles vulnerability detection in open-source code under real-world distribution shifts and long-range dependencies. It introduces Semantic Vulnerability Graphs (SVG) that fuse sequential, data, control, and novel Poacher Flow edges to capture rich program semantics, and combines RoBERTa embeddings with Graph Convolutional Networks in a multitask RoBERTa-PFGCN framework trained with Focal Loss to handle data imbalance. The authors present the VulF dataset alongside other real-world datasets, demonstrating improved accuracy and reduced false positives/negatives, including strong performance on N-day and zero-day samples (e.g., 93% accuracy on 273 N-day cases and correct zero-day detections). The approach provides not only vulnerability detection but also CWE descriptions, aiding developers in remediation and prioritization with practical impact for scalable software security.

Abstract

Over the years, open-source software systems have become prey to threat actors. Even as open-source communities act quickly to patch the breach, code vulnerability screening should be an integral part of agile software development from the beginning. Unfortunately, current vulnerability screening techniques are ineffective at identifying novel vulnerabilities or providing developers with code vulnerability and classification. Furthermore, the datasets used for vulnerability learning often exhibit distribution shifts from the real-world testing distribution due to novel attack strategies deployed by adversaries and as a result, the machine learning model's performance may be hindered or biased. To address these issues, we propose a joint interpolated multitasked unbiased vulnerability classifier comprising a transformer "RoBERTa" and graph convolution neural network (GCN). We present a training process utilizing a semantic vulnerability graph (SVG) representation from source code, created by integrating edges from a sequential flow, control flow, and data flow, as well as a novel flow dubbed Poacher Flow (PF). Poacher flow edges reduce the gap between dynamic and static program analysis and handle complex long-range dependencies. Moreover, our approach reduces biases of classifiers regarding unbalanced datasets by integrating Focal Loss objective function along with SVG. Remarkably, experimental results show that our classifier outperforms state-of-the-art results on vulnerability detection with fewer false negatives and false positives. After testing our model across multiple datasets, it shows an improvement of at least 2.41% and 18.75% in the best-case scenario. Evaluations using N-day program samples demonstrate that our proposed approach achieves a 93% accuracy and was able to detect 4, zero-day vulnerabilities from popular GitHub repositories.

An Unbiased Transformer Source Code Learning with Semantic Vulnerability Graph

TL;DR

This work tackles vulnerability detection in open-source code under real-world distribution shifts and long-range dependencies. It introduces Semantic Vulnerability Graphs (SVG) that fuse sequential, data, control, and novel Poacher Flow edges to capture rich program semantics, and combines RoBERTa embeddings with Graph Convolutional Networks in a multitask RoBERTa-PFGCN framework trained with Focal Loss to handle data imbalance. The authors present the VulF dataset alongside other real-world datasets, demonstrating improved accuracy and reduced false positives/negatives, including strong performance on N-day and zero-day samples (e.g., 93% accuracy on 273 N-day cases and correct zero-day detections). The approach provides not only vulnerability detection but also CWE descriptions, aiding developers in remediation and prioritization with practical impact for scalable software security.

Abstract

Over the years, open-source software systems have become prey to threat actors. Even as open-source communities act quickly to patch the breach, code vulnerability screening should be an integral part of agile software development from the beginning. Unfortunately, current vulnerability screening techniques are ineffective at identifying novel vulnerabilities or providing developers with code vulnerability and classification. Furthermore, the datasets used for vulnerability learning often exhibit distribution shifts from the real-world testing distribution due to novel attack strategies deployed by adversaries and as a result, the machine learning model's performance may be hindered or biased. To address these issues, we propose a joint interpolated multitasked unbiased vulnerability classifier comprising a transformer "RoBERTa" and graph convolution neural network (GCN). We present a training process utilizing a semantic vulnerability graph (SVG) representation from source code, created by integrating edges from a sequential flow, control flow, and data flow, as well as a novel flow dubbed Poacher Flow (PF). Poacher flow edges reduce the gap between dynamic and static program analysis and handle complex long-range dependencies. Moreover, our approach reduces biases of classifiers regarding unbalanced datasets by integrating Focal Loss objective function along with SVG. Remarkably, experimental results show that our classifier outperforms state-of-the-art results on vulnerability detection with fewer false negatives and false positives. After testing our model across multiple datasets, it shows an improvement of at least 2.41% and 18.75% in the best-case scenario. Evaluations using N-day program samples demonstrate that our proposed approach achieves a 93% accuracy and was able to detect 4, zero-day vulnerabilities from popular GitHub repositories.
Paper Structure (40 sections, 14 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 40 sections, 14 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Example of a vulnerability explanation as part of an automated code review process to help developers effectively resolve static software security issues.
  • Figure 2: Overall Architecture of our Classifier: Our classifier is divided into three parts. Initially, the input source code is pre-processed by creating an SVG. Then RoBERTa layer generates embedding for each token/node of the graph. Finally, the GCN layer takes the node embedding and adjacency matrix for feature generation. Focal Loss forces the model to learn more about the minority class. The MLP layer decides whether a function is vulnerable by leveraging the Focal Loss Function.
  • Figure 3: Depiction of our SVG. Each gray box shows individual tokens of our SVG. The red line depicts a poacher flow edge, the black line depicts data flow edges, the blue line depicts control flow edges and the gray line depicts sequential flow edges.
  • Figure 4: CWE class-by-class F1 score comparison on our proposed Model (RoBERTa-PFGCN), vs. $\mu$VulDeepecker on MVD dataset provided by $\mu$VulDeePecker including 40 CWE classes. The blue bar corresponds to RoBERTa-PFGCN, while the orange bar represents $\mu$VulDeepecker.
  • Figure 5: An example code for CWE-190, which our classifier predicted accurately. The red edge shows a Poacher Flow edge that captures the Data Processing of the code. Hence, our classifier was able to detect the vulnerability with a description.
  • ...and 3 more figures