Table of Contents
Fetching ...

Structure-Aware Code Vulnerability Analysis With Graph Neural Networks

Ravil Mussabayev

TL;DR

This work evaluates the general applicability of graph neural networks for code vulnerability detection by reproducing a ReVeal-like architecture on C++ and Java datasets derived from vulnerability-fixing commits. It systematically investigates graph representations, pruning, and data partitioning, revealing that pruning operator nodes and omitting certain fine-grained edges can improve detection performance, while random/synthetic training data (P3) is crucial for achieving strong results. The Java experiments show that distinguishing vulnerable from fixed code (T1) is substantially harder than separating near-vulnerable or random code (T2), and that including random data in training helps T2 but not necessarily T1. Overall, the findings provide practical guidance for configuring GNN-based vulnerability analysis and point to future directions in data augmentation and model design to tackle fine-grained code differences.

Abstract

This study explores the effectiveness of graph neural networks (GNNs) for vulnerability detection in software code, utilizing a real-world dataset of Java vulnerability-fixing commits. The dataset's structure, based on the number of modified methods in each commit, offers a natural partition that facilitates diverse investigative scenarios. The primary focus is to evaluate the general applicability of GNNs in identifying vulnerable code segments and distinguishing these from their fixed versions, as well as from random non-vulnerable code. Through a series of experiments, the research addresses key questions about the suitability of different configurations and subsets of data in enhancing the prediction accuracy of GNN models. Experiments indicate that certain model configurations, such as the pruning of specific graph elements and the exclusion of certain types of code representation, significantly improve performance. Additionally, the study highlights the importance of including random data in training to optimize the detection capabilities of GNNs.

Structure-Aware Code Vulnerability Analysis With Graph Neural Networks

TL;DR

This work evaluates the general applicability of graph neural networks for code vulnerability detection by reproducing a ReVeal-like architecture on C++ and Java datasets derived from vulnerability-fixing commits. It systematically investigates graph representations, pruning, and data partitioning, revealing that pruning operator nodes and omitting certain fine-grained edges can improve detection performance, while random/synthetic training data (P3) is crucial for achieving strong results. The Java experiments show that distinguishing vulnerable from fixed code (T1) is substantially harder than separating near-vulnerable or random code (T2), and that including random data in training helps T2 but not necessarily T1. Overall, the findings provide practical guidance for configuring GNN-based vulnerability analysis and point to future directions in data augmentation and model design to tackle fine-grained code differences.

Abstract

This study explores the effectiveness of graph neural networks (GNNs) for vulnerability detection in software code, utilizing a real-world dataset of Java vulnerability-fixing commits. The dataset's structure, based on the number of modified methods in each commit, offers a natural partition that facilitates diverse investigative scenarios. The primary focus is to evaluate the general applicability of GNNs in identifying vulnerable code segments and distinguishing these from their fixed versions, as well as from random non-vulnerable code. Through a series of experiments, the research addresses key questions about the suitability of different configurations and subsets of data in enhancing the prediction accuracy of GNN models. Experiments indicate that certain model configurations, such as the pruning of specific graph elements and the exclusion of certain types of code representation, significantly improve performance. Additionally, the study highlights the importance of including random data in training to optimize the detection capabilities of GNNs.
Paper Structure (15 sections, 3 equations, 7 figures, 2 tables)

This paper contains 15 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Architecture of the ReVeal model
  • Figure 2: ReVeal model trained on parts, tested on $P_1 \cup P_2 \cup P_3$.
  • Figure 3: ReVeal model trained on parts, tested on $P_1 \cup P_3$. This is a stricter test for task $T_2$ than the one on Figure \ref{['fig:aucs_p1p2p3']}).
  • Figure 4: ReVeal model trained on parts, tested on $P_1$. This is a strict test for task $T_1$.
  • Figure 5: ReVeal model trained on parts, tested on $P_1$ (all marked positive) $\cup$$P_3$ (all marked negative). This is a strict test for task $T_2$.
  • ...and 2 more figures