Table of Contents
Fetching ...

Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation

Shangqing Liu, Wei Ma, Jian Wang, Xiaofei Xie, Ruitao Feng, Yang Liu

TL;DR

The paper addresses the challenge of detecting code vulnerabilities by moving beyond binary classification to fine-grained, type-specific detection. It introduces FGVulDet, which trains multiple type-specific classifiers on an edge-aware GGNN over Code Property Graphs, and uses vulnerability-preserving data augmentation to enrich scarce vulnerability-type data. A novel slicing-based augmentation and mutation framework preserves vulnerability semantics while expanding data diversity, and an edge-aware GGNN leverages edge types to better capture program semantics. Experiments on a large GitHub-derived dataset with five CWE types show improved recall and F1 over static and deep learning baselines, highlighting the method's potential to enhance practical vulnerability discovery and generalization in software security.

Abstract

Source code vulnerability detection aims to identify inherent vulnerabilities to safeguard software systems from potential attacks. Many prior studies overlook diverse vulnerability characteristics, simplifying the problem into a binary (0-1) classification task for example determining whether it is vulnerable or not. This poses a challenge for a single deep learning-based model to effectively learn the wide array of vulnerability characteristics. Furthermore, due to the challenges associated with collecting large-scale vulnerability data, these detectors often overfit limited training datasets, resulting in lower model generalization performance. To address the aforementioned challenges, in this work, we introduce a fine-grained vulnerability detector namely FGVulDet. Unlike previous approaches, FGVulDet employs multiple classifiers to discern characteristics of various vulnerability types and combines their outputs to identify the specific type of vulnerability. Each classifier is designed to learn type-specific vulnerability semantics. Additionally, to address the scarcity of data for some vulnerability types and enhance data diversity for learning better vulnerability semantics, we propose a novel vulnerability-preserving data augmentation technique to augment the number of vulnerabilities. Taking inspiration from recent advancements in graph neural networks for learning program semantics, we incorporate a Gated Graph Neural Network (GGNN) and extend it to an edge-aware GGNN to capture edge-type information. FGVulDet is trained on a large-scale dataset from GitHub, encompassing five different types of vulnerabilities. Extensive experiments compared with static-analysis-based approaches and learning-based approaches have demonstrated the effectiveness of FGVulDet.

Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation

TL;DR

The paper addresses the challenge of detecting code vulnerabilities by moving beyond binary classification to fine-grained, type-specific detection. It introduces FGVulDet, which trains multiple type-specific classifiers on an edge-aware GGNN over Code Property Graphs, and uses vulnerability-preserving data augmentation to enrich scarce vulnerability-type data. A novel slicing-based augmentation and mutation framework preserves vulnerability semantics while expanding data diversity, and an edge-aware GGNN leverages edge types to better capture program semantics. Experiments on a large GitHub-derived dataset with five CWE types show improved recall and F1 over static and deep learning baselines, highlighting the method's potential to enhance practical vulnerability discovery and generalization in software security.

Abstract

Source code vulnerability detection aims to identify inherent vulnerabilities to safeguard software systems from potential attacks. Many prior studies overlook diverse vulnerability characteristics, simplifying the problem into a binary (0-1) classification task for example determining whether it is vulnerable or not. This poses a challenge for a single deep learning-based model to effectively learn the wide array of vulnerability characteristics. Furthermore, due to the challenges associated with collecting large-scale vulnerability data, these detectors often overfit limited training datasets, resulting in lower model generalization performance. To address the aforementioned challenges, in this work, we introduce a fine-grained vulnerability detector namely FGVulDet. Unlike previous approaches, FGVulDet employs multiple classifiers to discern characteristics of various vulnerability types and combines their outputs to identify the specific type of vulnerability. Each classifier is designed to learn type-specific vulnerability semantics. Additionally, to address the scarcity of data for some vulnerability types and enhance data diversity for learning better vulnerability semantics, we propose a novel vulnerability-preserving data augmentation technique to augment the number of vulnerabilities. Taking inspiration from recent advancements in graph neural networks for learning program semantics, we incorporate a Gated Graph Neural Network (GGNN) and extend it to an edge-aware GGNN to capture edge-type information. FGVulDet is trained on a large-scale dataset from GitHub, encompassing five different types of vulnerabilities. Extensive experiments compared with static-analysis-based approaches and learning-based approaches have demonstrated the effectiveness of FGVulDet.
Paper Structure (36 sections, 9 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 36 sections, 9 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: An example to illustrate Code Property Graph.
  • Figure 2: The framework of FGVulDet.
  • Figure 3: Patch for Buffer Overflow.