Table of Contents
Fetching ...

Explainable Malware Detection through Integrated Graph Reduction and Learning Techniques

Hesamodin Mohammadian, Griffin Higgins, Samuel Ansong, Roozbeh Razavi-Far, Ali A. Ghorbani

TL;DR

This work tackles malware detection using CFGs and FCGs by integrating graph reduction and explainability into a GNN-based pipeline. It introduces Leaf Prune, Comp Prune, K-core, and Walk Index Sparsification to shrink large program graphs while preserving discriminative information, and employs two node embeddings (Function Name Embedding and Assembly Embedding) to feed a GCN classifier. The framework is augmented with GNNExplainer to provide interpretable subgraph explanations, demonstrating that leaf pruning often yields the best efficiency-accuracy trade-off and that AE generally outperforms FNE. The approach shows promise for scalable, transparent malware detection on real-world datasets like BODMAS, Dike, and PMML, enabling faster analysis with meaningful explanations for security analysts.

Abstract

Control Flow Graphs and Function Call Graphs have become pivotal in providing a detailed understanding of program execution and effectively characterizing the behavior of malware. These graph-based representations, when combined with Graph Neural Networks (GNN), have shown promise in developing high-performance malware detectors. However, challenges remain due to the large size of these graphs and the inherent opacity in the decision-making process of GNNs. This paper addresses these issues by developing several graph reduction techniques to reduce graph size and applying the state-of-the-art GNNExplainer to enhance the interpretability of GNN outputs. The analysis demonstrates that integrating our proposed graph reduction technique along with GNNExplainer in the malware detection framework significantly reduces graph size while preserving high performance, providing an effective balance between efficiency and transparency in malware detection.

Explainable Malware Detection through Integrated Graph Reduction and Learning Techniques

TL;DR

This work tackles malware detection using CFGs and FCGs by integrating graph reduction and explainability into a GNN-based pipeline. It introduces Leaf Prune, Comp Prune, K-core, and Walk Index Sparsification to shrink large program graphs while preserving discriminative information, and employs two node embeddings (Function Name Embedding and Assembly Embedding) to feed a GCN classifier. The framework is augmented with GNNExplainer to provide interpretable subgraph explanations, demonstrating that leaf pruning often yields the best efficiency-accuracy trade-off and that AE generally outperforms FNE. The approach shows promise for scalable, transparent malware detection on real-world datasets like BODMAS, Dike, and PMML, enabling faster analysis with meaningful explanations for security analysts.

Abstract

Control Flow Graphs and Function Call Graphs have become pivotal in providing a detailed understanding of program execution and effectively characterizing the behavior of malware. These graph-based representations, when combined with Graph Neural Networks (GNN), have shown promise in developing high-performance malware detectors. However, challenges remain due to the large size of these graphs and the inherent opacity in the decision-making process of GNNs. This paper addresses these issues by developing several graph reduction techniques to reduce graph size and applying the state-of-the-art GNNExplainer to enhance the interpretability of GNN outputs. The analysis demonstrates that integrating our proposed graph reduction technique along with GNNExplainer in the malware detection framework significantly reduces graph size while preserving high performance, providing an effective balance between efficiency and transparency in malware detection.

Paper Structure

This paper contains 22 sections, 7 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Proposed framework for malware detection including two modules for data collection and graph generation, two modules for pre-processing, i.e., node embedding and graph reduction, a decision-making module, and a post-analysis module named explainability.
  • Figure 2: Schematic diagram of assembly instruction embedding.
  • Figure 3: Graph classification model architecture.
  • Figure 4: Visualization of example malicious and benign samples. Each panel shows a generated graph through an embedding module with/without pruning. NP and LP stand for Leaf Prune and No Prune, respectively.
  • Figure 5: Comparing number of nodes, edges, and components of FCG (top panel) and CFG (bottom panel) generated through each graph reduction technique, with no pruning as baseline.
  • ...and 5 more figures