Table of Contents
Fetching ...

Incremental Gaussian Mixture Clustering for Data Streams

Aniket Bhanderi, Raj Bhatnagar

TL;DR

This work tackles real-time clustering and anomaly detection in data streams by maintaining a large set of Gaussian cluster signatures with full covariance in a sketch memory. Clusters and anomalies are updated chunk-by-chunk using an entropy-based merging strategy and a compression module, enabling the final sketch to summarize the stream with a user-specified granularity while preserving clustering fidelity comparable to offline Gaussian mixtures. The approach supports anomaly scoring via Mahalanobis distances and tracks concept drift through time-stamped cluster updates, providing rich, temporal insights into evolving data. Across synthetic and two public 2-D datasets, the method achieves Rand indices near those of batch GMMs, demonstrating practical viability for efficient, structure-preserving streaming clustering and anomaly detection.

Abstract

The problem of analyzing data streams of very large volumes is important and is very desirable for many application domains. In this paper we present and demonstrate effective working of an algorithm to find clusters and anomalous data points in a streaming datasets. Entropy minimization is used as a criterion for defining and updating clusters formed from a streaming dataset. As the clusters are formed we also identify anomalous datapoints that show up far away from all known clusters. With a number of 2-D datasets we demonstrate the effectiveness of discovering the clusters and also identifying anomalous data points.

Incremental Gaussian Mixture Clustering for Data Streams

TL;DR

This work tackles real-time clustering and anomaly detection in data streams by maintaining a large set of Gaussian cluster signatures with full covariance in a sketch memory. Clusters and anomalies are updated chunk-by-chunk using an entropy-based merging strategy and a compression module, enabling the final sketch to summarize the stream with a user-specified granularity while preserving clustering fidelity comparable to offline Gaussian mixtures. The approach supports anomaly scoring via Mahalanobis distances and tracks concept drift through time-stamped cluster updates, providing rich, temporal insights into evolving data. Across synthetic and two public 2-D datasets, the method achieves Rand indices near those of batch GMMs, demonstrating practical viability for efficient, structure-preserving streaming clustering and anomaly detection.

Abstract

The problem of analyzing data streams of very large volumes is important and is very desirable for many application domains. In this paper we present and demonstrate effective working of an algorithm to find clusters and anomalous data points in a streaming datasets. Entropy minimization is used as a criterion for defining and updating clusters formed from a streaming dataset. As the clusters are formed we also identify anomalous datapoints that show up far away from all known clusters. With a number of 2-D datasets we demonstrate the effectiveness of discovering the clusters and also identifying anomalous data points.

Paper Structure

This paper contains 20 sections, 5 equations, 9 figures, 1 algorithm.

Figures (9)

  • Figure 1: Incoming k$^{th}$ data chunk in Synthetic data
  • Figure 2: Signature of base clusters when k$^{th}$ data chunk arrived
  • Figure 3: Signature of base clusters after k$^{th}$ data chunk merged with the base cluster
  • Figure 4: Synthetic data with 10 clusters (left) and 7 clusters (right)
  • Figure 5: Concept drift in Synthetic dataset
  • ...and 4 more figures