Incremental Gaussian Mixture Clustering for Data Streams
Aniket Bhanderi, Raj Bhatnagar
TL;DR
This work tackles real-time clustering and anomaly detection in data streams by maintaining a large set of Gaussian cluster signatures with full covariance in a sketch memory. Clusters and anomalies are updated chunk-by-chunk using an entropy-based merging strategy and a compression module, enabling the final sketch to summarize the stream with a user-specified granularity while preserving clustering fidelity comparable to offline Gaussian mixtures. The approach supports anomaly scoring via Mahalanobis distances and tracks concept drift through time-stamped cluster updates, providing rich, temporal insights into evolving data. Across synthetic and two public 2-D datasets, the method achieves Rand indices near those of batch GMMs, demonstrating practical viability for efficient, structure-preserving streaming clustering and anomaly detection.
Abstract
The problem of analyzing data streams of very large volumes is important and is very desirable for many application domains. In this paper we present and demonstrate effective working of an algorithm to find clusters and anomalous data points in a streaming datasets. Entropy minimization is used as a criterion for defining and updating clusters formed from a streaming dataset. As the clusters are formed we also identify anomalous datapoints that show up far away from all known clusters. With a number of 2-D datasets we demonstrate the effectiveness of discovering the clusters and also identifying anomalous data points.
