Sequential Spectral Clustering of Data Sequences
G Dhinesh Chandran, Kota Srinivas Reddy, Srikrishna Bhashyam
TL;DR
The paper tackles clustering of data sequences drawn from unknown distributions by extending spectral clustering to a sequential setting. It introduces SEQ-SPEC and IA-SEQ-SPEC, which estimate distribution distances via MMD, build affinity and spectral embeddings, and apply a time-dependent stopping rule based on arcsin(C/√t) to achieve correct clustering with limited samples. The authors establish finite-time termination and exponential consistency, and propose memory-efficient (SEQ-SPEC-B/IA-SEQ-SPEC-B) and incremental (IA-SEQ-SPEC) variants to reduce storage and computation. Extensive simulations on synthetic and real datasets show that the sequential methods outperform fixed-sample SPEC and other sequential baselines, while the memory-friendly versions approach the performance of the exact SEQ-SPEC. Overall, the work provides scalable, theoretically sound tools for clustering data streams of distributions with practical applicability to evolving graphs and large-scale data sequences.
Abstract
We study the problem of non-parametric clustering of data sequences, where each data sequence comprises independent and identically distributed (i.i.d.) samples generated from an unknown distribution. The true clusters are the clusters obtained using the Spectral clustering algorithm (SPEC) on the pairwise distance between the true distributions corresponding to the data sequences. Since the true distributions are unknown, the objective is to estimate the clusters by observing the minimum number of samples from the data sequences, given a specified error probability. To solve this problem, we propose the Sequential Spectral clustering algorithm (SEQ-SPEC), and show that it stops in finite time almost surely and is exponentially consistent. We also propose a computationally more efficient algorithm called the Incremental Approximate Sequential Spectral clustering algorithm (IA-SEQ-SPEC). Through simulations, we show that both SEQ-SPEC and IA-SEQ-SPEC perform better than the fixed sample size SPEC, the Sequential $K$-Medoids clustering algorithm (SEQ-KMED), and the Sequential Single Linkage clustering algorithm (SEQ-SLINK). In addition, we propose memory-efficient versions, SEQ-SPEC-B and IA-SEQ-SPEC-B. Unlike other related sequential clustering algorithms, which require storing all past samples, these algorithms require storing only the most recent $B$ samples. Both the computationally efficient and memory-efficient versions of SEQ-SPEC perform comparably to SEQ-SPEC in simulations.
