Microsoft Malware Classification Challenge
Royi Ronen, Marian Radu, Corina Feuerstein, Elad Yom-Tov, Mansour Ahmadi
TL;DR
The paper surveys the Microsoft Malware Classification Challenge dataset, created to address the enormous data volume and polymorphism in malware by enabling family-level classification. It documents the dataset's scale, structure, and labeling, including raw hex content and IDA-derived metadata, and analyzes how the community has cited and used the dataset. The authors categorize the literature into broad discussions of ML relevance and empirical studies that leverage the dataset for scalability, feature engineering, robustness, drift detection, and deep learning, highlighting the field's diverse approaches. Ultimately, the work reinforces the dataset's role as a standard benchmark and calls for ongoing updates and citations as new research emerges.
Abstract
The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 terabytes, consisting of disassembly and bytecode of more than 20K malware samples. Apart from serving in the Kaggle competition, the dataset has become a standard benchmark for research on modeling malware behaviour. To date, the dataset has been cited in more than 50 research papers. Here we provide a high-level comparison of the publications citing the dataset. The comparison simplifies finding potential research directions in this field and future performance evaluation of the dataset.
