Clustering Malware at Scale: A First Full-Benchmark Study
Martin Mocko, Jakub Ševcech, Daniela Chudá
TL;DR
This paper addresses the lack of large-scale, inclusive malware clustering studies by benchmarking clustering on full public datasets Bodmas and Ember (and a private Security dataset) with benign samples. It compares multiple representations (PCA, Autoencoder, UMAP) and clustering algorithms (K-Means, DBSCAN, HAC, BIRCH), finding that K-Means and BIRCH generally outperform DBSCAN and HAC, with clustering quality highly dependent on dataset composition. The study shows that including benign samples does not significantly degrade Homogeneity, and provides nuanced guidance on representations (PCA often best, UMAP for smaller datasets, Autoencoder for larger ones) and cluster count. It also includes an ablation analysis of the representation components, suggesting limited gains from increasing components but potential benefits from more clusters, while highlighting practical limitations and future research directions in representation learning and broader algorithm coverage.
Abstract
Recent years have shown that malware attacks still happen with high frequency. Malware experts seek to categorize and classify incoming samples to confirm their trustworthiness or prove their maliciousness. One of the ways in which groups of malware samples can be identified is through malware clustering. Despite the efforts of the community, malware clustering which incorporates benign samples has been under-explored. Moreover, despite the availability of larger public benchmark malware datasets, malware clustering studies have avoided fully utilizing these datasets in their experiments, often resorting to small datasets with only a few families. Additionally, the current state-of-the-art solutions for malware clustering remain unclear. In our study, we evaluate malware clustering quality and establish the state-of-the-art on Bodmas and Ember - two large public benchmark malware datasets. Ours is the first study of malware clustering performed on whole malware benchmark datasets. Additionally, we extend the malware clustering task by incorporating benign samples. Our results indicate that incorporating benign samples does not significantly degrade clustering quality. We find that there are differences in the quality of the created clusters between Ember and Bodmas, as well as a private industry dataset. Contrary to popular opinion, our top clustering performers are K-Means and BIRCH, with DBSCAN and HAC falling behind.
