Table of Contents
Fetching ...

Online Clustering of Known and Emerging Malware Families

Olha Jurečková, Martin Jureček, Mark Stamp

TL;DR

The paper addresses the challenge of online clustering for rapidly expanding malware collections by proposing a model that partitions streaming samples into known malware families or emergent families using a clustering decision rule with parameter $\tau$. It combines WKNN-based mapping of streaming data to known clusters with online clustering (OKM, SOM, BSAS) for new families, enabling incremental updates without full reprocessing. Empirical results on the EMBER static-PE feature set show high cluster purity (up to $93.34\%$) and strong silhouette scores (up to $0.99$), with practical runtimes capable of handling daily influxes of hundreds of thousands of samples. The approach offers a practical path toward faster malware analysis and evolution-aware defense, with potential extensions via semi-supervised learning and broader family coverage.

Abstract

Malware attacks have become significantly more frequent and sophisticated in recent years. Therefore, malware detection and classification are critical components of information security. Due to the large amount of malware samples available, it is essential to categorize malware samples according to their malicious characteristics. Clustering algorithms are thus becoming more widely used in computer security to analyze the behavior of malware variants and discover new malware families. Online clustering algorithms help us to understand malware behavior and produce a quicker response to new threats. This paper introduces a novel machine learning-based model for the online clustering of malicious samples into malware families. Streaming data is divided according to the clustering decision rule into samples from known and new emerging malware families. The streaming data is classified using the weighted k-nearest neighbor classifier into known families, and the online k-means algorithm clusters the remaining streaming data and achieves a purity of clusters from 90.20% for four clusters to 93.34% for ten clusters. This work is based on static analysis of portable executable files for the Windows operating system. Experimental results indicate that the proposed online clustering model can create high-purity clusters corresponding to malware families. This allows malware analysts to receive similar malware samples, speeding up their analysis.

Online Clustering of Known and Emerging Malware Families

TL;DR

The paper addresses the challenge of online clustering for rapidly expanding malware collections by proposing a model that partitions streaming samples into known malware families or emergent families using a clustering decision rule with parameter . It combines WKNN-based mapping of streaming data to known clusters with online clustering (OKM, SOM, BSAS) for new families, enabling incremental updates without full reprocessing. Empirical results on the EMBER static-PE feature set show high cluster purity (up to ) and strong silhouette scores (up to ), with practical runtimes capable of handling daily influxes of hundreds of thousands of samples. The approach offers a practical path toward faster malware analysis and evolution-aware defense, with potential extensions via semi-supervised learning and broader family coverage.

Abstract

Malware attacks have become significantly more frequent and sophisticated in recent years. Therefore, malware detection and classification are critical components of information security. Due to the large amount of malware samples available, it is essential to categorize malware samples according to their malicious characteristics. Clustering algorithms are thus becoming more widely used in computer security to analyze the behavior of malware variants and discover new malware families. Online clustering algorithms help us to understand malware behavior and produce a quicker response to new threats. This paper introduces a novel machine learning-based model for the online clustering of malicious samples into malware families. Streaming data is divided according to the clustering decision rule into samples from known and new emerging malware families. The streaming data is classified using the weighted k-nearest neighbor classifier into known families, and the online k-means algorithm clusters the remaining streaming data and achieves a purity of clusters from 90.20% for four clusters to 93.34% for ten clusters. This work is based on static analysis of portable executable files for the Windows operating system. Experimental results indicate that the proposed online clustering model can create high-purity clusters corresponding to malware families. This allows malware analysts to receive similar malware samples, speeding up their analysis.
Paper Structure (18 sections, 10 equations, 8 figures, 2 tables, 3 algorithms)

This paper contains 18 sections, 10 equations, 8 figures, 2 tables, 3 algorithms.

Figures (8)

  • Figure 1: The architecture of the proposed model for the online clustering of malicious samples to malware families.
  • Figure 2: Demonstration of the decision rule \ref{['filt_rule']} used to determine whether the sample $x_{t+1}$ will remain in the nearest cluster $C_1 \subset D$ corresponding to a known malware family and the sample $x_{t+2}$ will be assigned into cluster corresponding to a new malware family. Three nearest neighbors of the sample $x_{t+1}$ are highlighted using the circle.
  • Figure 3: The relationship between the number of features and the silhouette coefficient.
  • Figure 4: The relationship between the parameter $\tau$ and the percentage of streaming data clustered to new malware families.
  • Figure 5: The relation between the number of clusters and the purity of clusters (a), respectively, the average silhouette coefficient (b). The results correspond to samples that were clustered to new malware families.
  • ...and 3 more figures