Table of Contents
Fetching ...

MalMixer: Few-Shot Malware Classification with Retrieval-Augmented Semi-Supervised Learning

Jiliang Li, Yifan Zhang, Yu Huang, Kevin Leach

TL;DR

MalMixer tackles the challenge of few shot malware family classification under rapid influx of new samples by integrating a retrieval augmented data augmentation pipeline with a semi supervised learning framework. It partitions static malware features into interpolatable and non interpolatable sets, and uses domain knowledge guided retrieval and manifold alignment to synthesize plausible samples that reflect the ground truth distributions. The approach yields state of the art performance in few shot scenarios on BODMAS and MOTIF benchmarks, demonstrates robustness to temporal shifts, and shows practical retraining benefits when new families are encountered. The work suggests that domain knowledge aware augmentation together with semi supervised learning can substantially reduce manual reverse engineering while preserving classification accuracy in real world malware defense tasks.

Abstract

Recent growth and proliferation of malware have tested practitioners ability to promptly classify new samples according to malware families. In contrast to labor-intensive reverse engineering efforts, machine learning approaches have demonstrated increased speed and accuracy. However, most existing deep-learning malware family classifiers must be calibrated using a large number of samples that are painstakingly manually analyzed before training. Furthermore, as novel malware samples arise that are beyond the scope of the training set, additional reverse engineering effort must be employed to update the training set. The sheer volume of new samples found in the wild creates substantial pressure on practitioners ability to reverse engineer enough malware to adequately train modern classifiers. In this paper, we present MalMixer, a malware family classifier using semi-supervised learning that achieves high accuracy with sparse training data. We present a domain-knowledge-aware data augmentation technique for malware feature representations, enhancing few-shot performance of semi-supervised malware family classification. We show that MalMixer achieves state-of-the-art performance in few-shot malware family classification settings. Our research confirms the feasibility and effectiveness of lightweight, domain-knowledge-aware data augmentation methods for malware features and shows the capabilities of similar semi-supervised classifiers in addressing malware classification issues.

MalMixer: Few-Shot Malware Classification with Retrieval-Augmented Semi-Supervised Learning

TL;DR

MalMixer tackles the challenge of few shot malware family classification under rapid influx of new samples by integrating a retrieval augmented data augmentation pipeline with a semi supervised learning framework. It partitions static malware features into interpolatable and non interpolatable sets, and uses domain knowledge guided retrieval and manifold alignment to synthesize plausible samples that reflect the ground truth distributions. The approach yields state of the art performance in few shot scenarios on BODMAS and MOTIF benchmarks, demonstrates robustness to temporal shifts, and shows practical retraining benefits when new families are encountered. The work suggests that domain knowledge aware augmentation together with semi supervised learning can substantially reduce manual reverse engineering while preserving classification accuracy in real world malware defense tasks.

Abstract

Recent growth and proliferation of malware have tested practitioners ability to promptly classify new samples according to malware families. In contrast to labor-intensive reverse engineering efforts, machine learning approaches have demonstrated increased speed and accuracy. However, most existing deep-learning malware family classifiers must be calibrated using a large number of samples that are painstakingly manually analyzed before training. Furthermore, as novel malware samples arise that are beyond the scope of the training set, additional reverse engineering effort must be employed to update the training set. The sheer volume of new samples found in the wild creates substantial pressure on practitioners ability to reverse engineer enough malware to adequately train modern classifiers. In this paper, we present MalMixer, a malware family classifier using semi-supervised learning that achieves high accuracy with sparse training data. We present a domain-knowledge-aware data augmentation technique for malware feature representations, enhancing few-shot performance of semi-supervised malware family classification. We show that MalMixer achieves state-of-the-art performance in few-shot malware family classification settings. Our research confirms the feasibility and effectiveness of lightweight, domain-knowledge-aware data augmentation methods for malware features and shows the capabilities of similar semi-supervised classifiers in addressing malware classification issues.
Paper Structure (28 sections, 5 equations, 9 figures, 11 tables, 2 algorithms)

This paper contains 28 sections, 5 equations, 9 figures, 11 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overview of our proposed data augmentation approach for malware feature representations. We divide static malware features into interpolatable and non-interpolatable sets, and use retrieval and alignment techniques to generate synthetic samples.
  • Figure 2: Illustration of our model framework, which consists of (1) a retrieval-based augmentation pipeline that grows the number of training samples and (2) an overarching augmentation-based semi-supervised classification framework that bolsters the augmentation pipeline.
  • Figure 3: Diagram illustrating Domain Invariance Learning for malware. As shown in part (a), the process contains three components: malware reconstruction, similarity learning, and dissimilarity learning. The static analysis features of malware are categorized into interpolatable ($s_i$) and non-interpolatable ($s_n$) features. During malware reconstruction, an encoder-decoder architecture accepts $s_i$ and $s_n$ as inputs and reconstructs them into $s_i^\prime$ and $s_n^\prime$. Simultaneously, the architecture's hidden features are divided into two sections, each forced to learn invariant and dissimilar features, respectively. Part (b) visualizes the learned hidden features, projecting malware into dissimilar spaces and an intersecting invariance space.
  • Figure 4: Illustration of embedding projection for malware. Pretrained encoder models transform each malware feature representation into embeddings for N-Retrieval and I-Alignment purposes.
  • Figure 5: Illustration of the retrieval and alignment process for identifying the optimal non-interpolatable $\mathcal{N}$ features for synthetic malware. The diagram contains three components in part (a): malware mixing, N-retrieval, and I-alignment. In malware mixing, two similar malware samples are combined based on their $\mathcal{N}$ and $\mathcal{I}$ features. The mixed $\mathcal{N}$ features are used to search for the top-k similar $\mathcal{N}$ in the database to represent the mixed $\mathcal{N}$. The mixed $\mathcal{I}$ features are then used to align with candidate $\mathcal{N}$ features and select the best aligned $\mathcal{N}$. Part (b) visualizes the process of choosing a set of $\mathcal{N}$ features with the highest degree of alignment from all candidates in the invariance embedding.
  • ...and 4 more figures