Fast, Space-Optimal Streaming Algorithms for Clustering and Subspace Embeddings
Vincent Cohen-Addad, Liudeng Wang, David P. Woodruff, Samson Zhou
TL;DR
This work resolves a central question in data streams by showing that core tasks—$(k,z)$-clustering and $L_p$ subspace embeddings—can be performed in one pass with space and time comparable to offline algorithms. The authors develop a unified framework that combines crude-and-refined sampling, merge-and-reduce, and an efficient global encoding to produce $(1+ ext{ε})$-strong coresets whose size is independent of the input size $n$, while achieving amortized update times of $ ilde{O}(d ext{log}(k))$ for clustering and $O(d)$ for subspace embeddings. For clustering, they prove a close connection between online clustering sensitivity and $(k,z)$-medoids sensitivity, enabling efficient, sublinear-time streaming algorithms that match the offline core-set bounds up to polylog factors. For subspace embeddings, the methods yield space-optimal streaming constructions with tight dependence on dimension $d$ and exponent $p$, showing there is no inherent overhead in streaming relative to offline for these tasks. Together, these results bridge the gap between streaming and offline models, enabling scalable, real-time data analysis with strong theoretical guarantees and practical encoding schemes.
Abstract
We show that both clustering and subspace embeddings can be performed in the streaming model with the same asymptotic efficiency as in the central/offline setting. For $(k, z)$-clustering in the streaming model, we achieve a number of words of memory which is independent of the number $n$ of input points and the aspect ratio $Δ$, yielding an optimal bound of $\tilde{\mathcal{O}}\left(\frac{dk}{\min(\varepsilon^4,\varepsilon^{z+2})}\right)$ words for accuracy parameter $\varepsilon$ on $d$-dimensional points. Additionally, we obtain amortized update time of $d\,\log(k)\cdot\text{polylog}(\log(nΔ))$, which is an exponential improvement over the previous $d\,\text{poly}(k,\log(nΔ))$. Our method also gives the fastest runtime for $(k,z)$-clustering even in the offline setting. For subspace embeddings in the streaming model, we achieve $\mathcal{O}(d)$ update time and space-optimal constructions, using $\tilde{\mathcal{O}}\left(\frac{d^2}{\varepsilon^2}\right)$ words for $p\le 2$ and $\tilde{\mathcal{O}}\left(\frac{d^{p/2+1}}{\varepsilon^2}\right)$ words for $p>2$, showing that streaming algorithms can match offline algorithms in both space and time complexity.
