Table of Contents
Fetching ...

Microsoft Malware Classification Challenge

Royi Ronen, Marian Radu, Corina Feuerstein, Elad Yom-Tov, Mansour Ahmadi

TL;DR

The paper surveys the Microsoft Malware Classification Challenge dataset, created to address the enormous data volume and polymorphism in malware by enabling family-level classification. It documents the dataset's scale, structure, and labeling, including raw hex content and IDA-derived metadata, and analyzes how the community has cited and used the dataset. The authors categorize the literature into broad discussions of ML relevance and empirical studies that leverage the dataset for scalability, feature engineering, robustness, drift detection, and deep learning, highlighting the field's diverse approaches. Ultimately, the work reinforces the dataset's role as a standard benchmark and calls for ongoing updates and citations as new research emerges.

Abstract

The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 terabytes, consisting of disassembly and bytecode of more than 20K malware samples. Apart from serving in the Kaggle competition, the dataset has become a standard benchmark for research on modeling malware behaviour. To date, the dataset has been cited in more than 50 research papers. Here we provide a high-level comparison of the publications citing the dataset. The comparison simplifies finding potential research directions in this field and future performance evaluation of the dataset.

Microsoft Malware Classification Challenge

TL;DR

The paper surveys the Microsoft Malware Classification Challenge dataset, created to address the enormous data volume and polymorphism in malware by enabling family-level classification. It documents the dataset's scale, structure, and labeling, including raw hex content and IDA-derived metadata, and analyzes how the community has cited and used the dataset. The authors categorize the literature into broad discussions of ML relevance and empirical studies that leverage the dataset for scalability, feature engineering, robustness, drift detection, and deep learning, highlighting the field's diverse approaches. Ultimately, the work reinforces the dataset's role as a standard benchmark and calls for ongoing updates and citations as new research emerges.

Abstract

The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 terabytes, consisting of disassembly and bytecode of more than 20K malware samples. Apart from serving in the Kaggle competition, the dataset has become a standard benchmark for research on modeling malware behaviour. To date, the dataset has been cited in more than 50 research papers. Here we provide a high-level comparison of the publications citing the dataset. The comparison simplifies finding potential research directions in this field and future performance evaluation of the dataset.

Paper Structure

This paper contains 4 sections, 2 tables.