Catch'em all: Classification of Rare, Prominent, and Novel Malware Families
Maksim E. Eren, Ryan Barron, Manish Bhattarai, Selma Wanna, Nicholas Solovyev, Kim Rasmussen, Boian S. Alexandrov, Charles Nicholas
TL;DR
This work introduces MalwareDNA, a semi-supervised framework that builds a latent-signature archive via hierarchical non-negative matrix factorization with automatic model selection (NMFk) to enable real-time malware family classification and novel-threat detection under class imbalance. Inference projects new samples onto the signature archive using NNLS and evaluates confidence with Projection Similarity, Ensemble Voting, and Data Augmentation, enabling a reject-option for abstention and novel-family identification. Experiments on EMBER-2018 Windows PE data show that MalwareDNA, particularly with Ensemble Voting, achieves high F1 scores on frequent and rare families and strong novel-threat rejection, outperforming supervised baselines (XGBoost/LightGBM) and semi-supervised variants. The approach delivers robust, scalable malware analysis with controllable coverage, aiding risk assessment and rapid mitigation of emerging threats.
Abstract
National security is threatened by malware, which remains one of the most dangerous and costly cyber threats. As of last year, researchers reported 1.3 billion known malware specimens, motivating the use of data-driven machine learning (ML) methods for analysis. However, shortcomings in existing ML approaches hinder their mass adoption. These challenges include detection of novel malware and the ability to perform malware classification in the face of class imbalance: a situation where malware families are not equally represented in the data. Our work addresses these shortcomings with MalwareDNA: an advanced dimensionality reduction and feature extraction framework. We demonstrate stable task performance under class imbalance for the following tasks: malware family classification and novel malware detection with a trade-off in increased abstention or reject-option rate.
