Table of Contents
Fetching ...

Vulnerability Detection in C/C++ Code with Deep Learning

Zhen Huang, Amy Aumpansub

TL;DR

This work tackles automatic vulnerability detection in C/C++ code by representing code as program slices that capture syntax and semantic cues related to vulnerabilities. It builds a large, balanced dataset by combining four vulnerability-relevant slice types, converts slices into Word2Vec embeddings, and trains neural networks to classify slices as vulnerable or non-vulnerable. The study shows that integrating multiple slice types and using a Bidirectional GRU with the ADAM optimizer yields the strongest performance, achieving about 92.5% test accuracy with high sensitivity and specificity. The approach demonstrates robust, scalable vulnerability detection and provides open-source tooling to facilitate replication and further research.

Abstract

Deep learning has been shown to be a promising tool in detecting software vulnerabilities. In this work, we train neural networks with program slices extracted from the source code of C/C++ programs to detect software vulnerabilities. The program slices capture the syntax and semantic characteristics of vulnerability-related program constructs, including API function call, array usage, pointer usage, and arithmetic expression. To achieve a strong prediction model for both vulnerable code and non-vulnerable code, we compare different types of training data, different optimizers, and different types of neural networks. Our result shows that combining different types of characteristics of source code and using a balanced number of vulnerable program slices and non-vulnerable program slices produce a balanced accuracy in predicting both vulnerable code and non-vulnerable code. Among different neural networks, BGRU with the ADAM optimizer performs the best in detecting software vulnerabilities with an accuracy of 92.49%.

Vulnerability Detection in C/C++ Code with Deep Learning

TL;DR

This work tackles automatic vulnerability detection in C/C++ code by representing code as program slices that capture syntax and semantic cues related to vulnerabilities. It builds a large, balanced dataset by combining four vulnerability-relevant slice types, converts slices into Word2Vec embeddings, and trains neural networks to classify slices as vulnerable or non-vulnerable. The study shows that integrating multiple slice types and using a Bidirectional GRU with the ADAM optimizer yields the strongest performance, achieving about 92.5% test accuracy with high sensitivity and specificity. The approach demonstrates robust, scalable vulnerability detection and provides open-source tooling to facilitate replication and further research.

Abstract

Deep learning has been shown to be a promising tool in detecting software vulnerabilities. In this work, we train neural networks with program slices extracted from the source code of C/C++ programs to detect software vulnerabilities. The program slices capture the syntax and semantic characteristics of vulnerability-related program constructs, including API function call, array usage, pointer usage, and arithmetic expression. To achieve a strong prediction model for both vulnerable code and non-vulnerable code, we compare different types of training data, different optimizers, and different types of neural networks. Our result shows that combining different types of characteristics of source code and using a balanced number of vulnerable program slices and non-vulnerable program slices produce a balanced accuracy in predicting both vulnerable code and non-vulnerable code. Among different neural networks, BGRU with the ADAM optimizer performs the best in detecting software vulnerabilities with an accuracy of 92.49%.
Paper Structure (14 sections, 7 figures, 16 tables)

This paper contains 14 sections, 7 figures, 16 tables.

Figures (7)

  • Figure 1: Generating Program Slices fro Source Code.
  • Figure 2: Cosine similarity.
  • Figure 3: Visualized tokens in W2V model for each program slice type.
  • Figure 4: Down-sampling and vector adjustment.
  • Figure 5: Model fitting of BLSTM and LSTM.
  • ...and 2 more figures