Table of Contents
Fetching ...

Revisiting Concept Drift in Windows Malware Detection: Adaptation to Real Drifted Malware with Minimal Samples

Adrian Shuai Li, Arun Iyengar, Ashish Kundu, Elisa Bertino

TL;DR

This work tackles concept drift in Windows malware detection under scarce drifted samples by learning drift-invariant CFG-based representations via graph neural networks and adversarial domain adaptation. It introduces a three-component pipeline (CFG construction, vertex feature extraction with PalmTree embeddings, and shift adaptation with a domain-adversarial GNN) and a graph-clustering approach to generate meaningful drift benchmarks. Comprehensive evaluations on Big-15 and MB-24 show the proposed method outperforms cold-start and warm-start retraining baselines and existing DA approaches, maintaining high accuracy with very few labeled target samples and robust performance under obfuscation and open-set conditions. The practical impact is a scalable, label-efficient framework for drift-aware malware detection that can adapt to evolving threats in real-world data, with released artifacts to facilitate replication.

Abstract

In applying deep learning for malware classification, it is crucial to account for the prevalence of malware evolution, which can cause trained classifiers to fail on drifted malware. Existing solutions to address concept drift use active learning. They select new samples for analysts to label and then retrain the classifier with the new labels. Our key finding is that the current retraining techniques do not achieve optimal results. These techniques overlook that updating the model with scarce drifted samples requires learning features that remain consistent across pre-drift and post-drift data. The model should thus be able to disregard specific features that, while beneficial for the classification of pre-drift data, are absent in post-drift data, thereby preventing prediction degradation. In this paper, we propose a new technique for detecting and classifying drifted malware that learns drift-invariant features in malware control flow graphs by leveraging graph neural networks with adversarial domain adaptation. We compare it with existing model retraining methods in active learning-based malware detection systems and other domain adaptation techniques from the vision domain. Our approach significantly improves drifted malware detection on publicly available benchmarks and real-world malware databases reported daily by security companies in 2024. We also tested our approach in predicting multiple malware families drifted over time. A thorough evaluation shows that our approach outperforms the state-of-the-art approaches.

Revisiting Concept Drift in Windows Malware Detection: Adaptation to Real Drifted Malware with Minimal Samples

TL;DR

This work tackles concept drift in Windows malware detection under scarce drifted samples by learning drift-invariant CFG-based representations via graph neural networks and adversarial domain adaptation. It introduces a three-component pipeline (CFG construction, vertex feature extraction with PalmTree embeddings, and shift adaptation with a domain-adversarial GNN) and a graph-clustering approach to generate meaningful drift benchmarks. Comprehensive evaluations on Big-15 and MB-24 show the proposed method outperforms cold-start and warm-start retraining baselines and existing DA approaches, maintaining high accuracy with very few labeled target samples and robust performance under obfuscation and open-set conditions. The practical impact is a scalable, label-efficient framework for drift-aware malware detection that can adapt to evolving threats in real-world data, with released artifacts to facilitate replication.

Abstract

In applying deep learning for malware classification, it is crucial to account for the prevalence of malware evolution, which can cause trained classifiers to fail on drifted malware. Existing solutions to address concept drift use active learning. They select new samples for analysts to label and then retrain the classifier with the new labels. Our key finding is that the current retraining techniques do not achieve optimal results. These techniques overlook that updating the model with scarce drifted samples requires learning features that remain consistent across pre-drift and post-drift data. The model should thus be able to disregard specific features that, while beneficial for the classification of pre-drift data, are absent in post-drift data, thereby preventing prediction degradation. In this paper, we propose a new technique for detecting and classifying drifted malware that learns drift-invariant features in malware control flow graphs by leveraging graph neural networks with adversarial domain adaptation. We compare it with existing model retraining methods in active learning-based malware detection systems and other domain adaptation techniques from the vision domain. Our approach significantly improves drifted malware detection on publicly available benchmarks and real-world malware databases reported daily by security companies in 2024. We also tested our approach in predicting multiple malware families drifted over time. A thorough evaluation shows that our approach outperforms the state-of-the-art approaches.
Paper Structure (83 sections, 11 equations, 13 figures, 11 tables)

This paper contains 83 sections, 11 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Overview of our approach: we show the assembly code on the left and the corresponding control flow graph on the right.
  • Figure 2: The inputs are source and target graph data, represented as node attribute matrix ($X^{s/t}$) and adjacency matrix ($A^{s/t}$). We obtain those two matrices for each graph after the Vertex Feature Extraction step in Figure \ref{['data']}. The training process is modeled as a minimax game between the generator and the discriminator. $h^s_G$ and $h^t_G$ denote the graph-level representations corresponding to the source and target inputs, respectively. Following the training process, the discriminator fails to discern the domain distinction solely based on $h^s_G$ and $h^t_G$. At the same time, they retain useful information crucial for achieving good classification in both domains.
  • Figure 3: Given a set of fixed target training labels, we compute the accuracy of the target testing data for different baseline techniques and our method. The left diagram reports the averaged accuracy based on the original label set of Big-15, and the right one reports results based on the cluster label assignment.
  • Figure 4: Visualization of the graph feature vector with their labels. The left part of the figure shows the data with the original labels from Big 15 ronen2018microsoft, and the right one shows the newly learned clusters. The legend represents the mapping between labels and colors.
  • Figure 5: Comparison of different representations under cold-start learning (left) and warm-start learning (right).
  • ...and 8 more figures