Table of Contents
Fetching ...

Catch'em all: Classification of Rare, Prominent, and Novel Malware Families

Maksim E. Eren, Ryan Barron, Manish Bhattarai, Selma Wanna, Nicholas Solovyev, Kim Rasmussen, Boian S. Alexandrov, Charles Nicholas

TL;DR

This work introduces MalwareDNA, a semi-supervised framework that builds a latent-signature archive via hierarchical non-negative matrix factorization with automatic model selection (NMFk) to enable real-time malware family classification and novel-threat detection under class imbalance. Inference projects new samples onto the signature archive using NNLS and evaluates confidence with Projection Similarity, Ensemble Voting, and Data Augmentation, enabling a reject-option for abstention and novel-family identification. Experiments on EMBER-2018 Windows PE data show that MalwareDNA, particularly with Ensemble Voting, achieves high F1 scores on frequent and rare families and strong novel-threat rejection, outperforming supervised baselines (XGBoost/LightGBM) and semi-supervised variants. The approach delivers robust, scalable malware analysis with controllable coverage, aiding risk assessment and rapid mitigation of emerging threats.

Abstract

National security is threatened by malware, which remains one of the most dangerous and costly cyber threats. As of last year, researchers reported 1.3 billion known malware specimens, motivating the use of data-driven machine learning (ML) methods for analysis. However, shortcomings in existing ML approaches hinder their mass adoption. These challenges include detection of novel malware and the ability to perform malware classification in the face of class imbalance: a situation where malware families are not equally represented in the data. Our work addresses these shortcomings with MalwareDNA: an advanced dimensionality reduction and feature extraction framework. We demonstrate stable task performance under class imbalance for the following tasks: malware family classification and novel malware detection with a trade-off in increased abstention or reject-option rate.

Catch'em all: Classification of Rare, Prominent, and Novel Malware Families

TL;DR

This work introduces MalwareDNA, a semi-supervised framework that builds a latent-signature archive via hierarchical non-negative matrix factorization with automatic model selection (NMFk) to enable real-time malware family classification and novel-threat detection under class imbalance. Inference projects new samples onto the signature archive using NNLS and evaluates confidence with Projection Similarity, Ensemble Voting, and Data Augmentation, enabling a reject-option for abstention and novel-family identification. Experiments on EMBER-2018 Windows PE data show that MalwareDNA, particularly with Ensemble Voting, achieves high F1 scores on frequent and rare families and strong novel-threat rejection, outperforming supervised baselines (XGBoost/LightGBM) and semi-supervised variants. The approach delivers robust, scalable malware analysis with controllable coverage, aiding risk assessment and rapid mitigation of emerging threats.

Abstract

National security is threatened by malware, which remains one of the most dangerous and costly cyber threats. As of last year, researchers reported 1.3 billion known malware specimens, motivating the use of data-driven machine learning (ML) methods for analysis. However, shortcomings in existing ML approaches hinder their mass adoption. These challenges include detection of novel malware and the ability to perform malware classification in the face of class imbalance: a situation where malware families are not equally represented in the data. Our work addresses these shortcomings with MalwareDNA: an advanced dimensionality reduction and feature extraction framework. We demonstrate stable task performance under class imbalance for the following tasks: malware family classification and novel malware detection with a trade-off in increased abstention or reject-option rate.
Paper Structure (14 sections, 3 figures, 3 tables)

This paper contains 14 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of how the archive of latent signatures is built from multi-dimensional data in a hierarchical manner. The patterns from the data are first extracted (S1). These patterns have the corresponding clusters among the samples (e.g. malware specimens, S2). If we identify a cluster where each sample belongs to the same class (uniform), we place the patterns (or latent signatures) corresponding to this cluster into the archive (S3). Otherwise, we separate the mixed signatures of samples belonging to a non-uniform cluster by successive factorization (going back to S1).
  • Figure 2: Risk-Coverage (RC) curve when classifying malware families and novel malware together with the area under the RC (AURC) for different MalwareDNA confidence metrics and our semi-supervised baselines.
  • Figure 3: Mean F1 scores with CI is reported for each malware family when comparing MalwareDNA with different confidence metrics to both our supervised and semi-supervised baselines. It can be seen that, while our baseline's performance degrade for the rare malware families (emoted, fareit, and zusy), our method maintains its performance.