Table of Contents
Fetching ...

A Research and Development Portfolio of GNN Centric Malware Detection, Explainability, and Dataset Curation

Hossein Shokouhinejad, Griffin Higgins, Roozbeh Razavi-Far, Ali A. Ghorbani

TL;DR

The paper addresses the core challenges of applying Graph Neural Networks to malware detection by presenting a cohesive portfolio of six interconnected studies that advance efficiency, interpretability, and reproducibility. It progresses from a foundational survey to graph reduction techniques (including Node-Centric Pruning with walks of fixed length $L$) and integrated pruning-learning frameworks, then to explainability and consistency via stability-promoting methods and dual prototype-based explanations. An ensemble framework with attention-guided stacking combines diverse GNNs and provides ensemble-aware explanations, while parallel dataset curation releases (CFGs/FCGs from PE files) enable reproducible research. Together, these contributions establish a comprehensive workflow that improves scalability, transparency, and practical deployment of GNN-based malware detection, and they provide valuable benchmarks for future work.

Abstract

Graph Neural Networks (GNNs) have become an effective tool for malware detection by capturing program execution through graph-structured representations. However, important challenges remain regarding scalability, interpretability, and the availability of reliable datasets. This paper brings together six related studies that collectively address these issues. The portfolio begins with a survey of graph-based malware detection and explainability, then advances to new graph reduction methods, integrated reduction-learning approaches, and investigations into the consistency of explanations. It also introduces dual explanation techniques based on subgraph matching and develops ensemble-based models with attention-guided stacked GNNs to improve interpretability. In parallel, curated datasets of control flow graphs are released to support reproducibility and enable future research. Together, these contributions form a coherent line of research that strengthens GNN-based malware detection by enhancing efficiency, increasing transparency, and providing solid experimental foundations.

A Research and Development Portfolio of GNN Centric Malware Detection, Explainability, and Dataset Curation

TL;DR

The paper addresses the core challenges of applying Graph Neural Networks to malware detection by presenting a cohesive portfolio of six interconnected studies that advance efficiency, interpretability, and reproducibility. It progresses from a foundational survey to graph reduction techniques (including Node-Centric Pruning with walks of fixed length ) and integrated pruning-learning frameworks, then to explainability and consistency via stability-promoting methods and dual prototype-based explanations. An ensemble framework with attention-guided stacking combines diverse GNNs and provides ensemble-aware explanations, while parallel dataset curation releases (CFGs/FCGs from PE files) enable reproducible research. Together, these contributions establish a comprehensive workflow that improves scalability, transparency, and practical deployment of GNN-based malware detection, and they provide valuable benchmarks for future work.

Abstract

Graph Neural Networks (GNNs) have become an effective tool for malware detection by capturing program execution through graph-structured representations. However, important challenges remain regarding scalability, interpretability, and the availability of reliable datasets. This paper brings together six related studies that collectively address these issues. The portfolio begins with a survey of graph-based malware detection and explainability, then advances to new graph reduction methods, integrated reduction-learning approaches, and investigations into the consistency of explanations. It also introduces dual explanation techniques based on subgraph matching and develops ensemble-based models with attention-guided stacked GNNs to improve interpretability. In parallel, curated datasets of control flow graphs are released to support reproducibility and enable future research. Together, these contributions form a coherent line of research that strengthens GNN-based malware detection by enhancing efficiency, increasing transparency, and providing solid experimental foundations.

Paper Structure

This paper contains 9 sections, 6 figures.

Figures (6)

  • Figure 1: Roadmap of graph-based malware detection CIC3, showing the link between datasets, analysis, feature engineering, graph reduction, embedding, and explainability.
  • Figure 2: Overview of graph reduction techniques, including coarsening, condensation, and sparsification CIC3.
  • Figure 3: Node feature embedding process, where raw assembly instructions from PE file control flow graphs are converted into fixed-length vectors and compressed into low-dimensional embeddings CIC2.
  • Figure 4: Dual explanation framework for malware detection, combining a base GNN explainer with a subgraph matching module (SubMatch) to connect local explanations Dual.
  • Figure 5: SubMatch explainer using subgraph matching to highlight relevant malicious (red) and benign (blue) regions within a target CFG Dual.
  • ...and 1 more figures