Fast, Space-Optimal Streaming Algorithms for Clustering and Subspace Embeddings

Vincent Cohen-Addad; Liudeng Wang; David P. Woodruff; Samson Zhou

Fast, Space-Optimal Streaming Algorithms for Clustering and Subspace Embeddings

Vincent Cohen-Addad, Liudeng Wang, David P. Woodruff, Samson Zhou

TL;DR

This work resolves a central question in data streams by showing that core tasks—$(k,z)$-clustering and $L_p$ subspace embeddings—can be performed in one pass with space and time comparable to offline algorithms. The authors develop a unified framework that combines crude-and-refined sampling, merge-and-reduce, and an efficient global encoding to produce $(1+ ext{ε})$-strong coresets whose size is independent of the input size $n$, while achieving amortized update times of $ ilde{O}(d ext{log}(k))$ for clustering and $O(d)$ for subspace embeddings. For clustering, they prove a close connection between online clustering sensitivity and $(k,z)$-medoids sensitivity, enabling efficient, sublinear-time streaming algorithms that match the offline core-set bounds up to polylog factors. For subspace embeddings, the methods yield space-optimal streaming constructions with tight dependence on dimension $d$ and exponent $p$, showing there is no inherent overhead in streaming relative to offline for these tasks. Together, these results bridge the gap between streaming and offline models, enabling scalable, real-time data analysis with strong theoretical guarantees and practical encoding schemes.

Abstract

We show that both clustering and subspace embeddings can be performed in the streaming model with the same asymptotic efficiency as in the central/offline setting. For $(k, z)$-clustering in the streaming model, we achieve a number of words of memory which is independent of the number $n$ of input points and the aspect ratio $Δ$, yielding an optimal bound of $\tilde{\mathcal{O}}\left(\frac{dk}{\min(\varepsilon^4,\varepsilon^{z+2})}\right)$ words for accuracy parameter $\varepsilon$ on $d$-dimensional points. Additionally, we obtain amortized update time of $d\,\log(k)\cdot\text{polylog}(\log(nΔ))$, which is an exponential improvement over the previous $d\,\text{poly}(k,\log(nΔ))$. Our method also gives the fastest runtime for $(k,z)$-clustering even in the offline setting. For subspace embeddings in the streaming model, we achieve $\mathcal{O}(d)$ update time and space-optimal constructions, using $\tilde{\mathcal{O}}\left(\frac{d^2}{\varepsilon^2}\right)$ words for $p\le 2$ and $\tilde{\mathcal{O}}\left(\frac{d^{p/2+1}}{\varepsilon^2}\right)$ words for $p>2$, showing that streaming algorithms can match offline algorithms in both space and time complexity.

Fast, Space-Optimal Streaming Algorithms for Clustering and Subspace Embeddings

TL;DR

This work resolves a central question in data streams by showing that core tasks—

-clustering and

subspace embeddings—can be performed in one pass with space and time comparable to offline algorithms. The authors develop a unified framework that combines crude-and-refined sampling, merge-and-reduce, and an efficient global encoding to produce

-strong coresets whose size is independent of the input size

, while achieving amortized update times of

for clustering and

for subspace embeddings. For clustering, they prove a close connection between online clustering sensitivity and

-medoids sensitivity, enabling efficient, sublinear-time streaming algorithms that match the offline core-set bounds up to polylog factors. For subspace embeddings, the methods yield space-optimal streaming constructions with tight dependence on dimension

and exponent

, showing there is no inherent overhead in streaming relative to offline for these tasks. Together, these results bridge the gap between streaming and offline models, enabling scalable, real-time data analysis with strong theoretical guarantees and practical encoding schemes.

Abstract

We show that both clustering and subspace embeddings can be performed in the streaming model with the same asymptotic efficiency as in the central/offline setting. For

-clustering in the streaming model, we achieve a number of words of memory which is independent of the number

of input points and the aspect ratio

, yielding an optimal bound of

words for accuracy parameter

-dimensional points. Additionally, we obtain amortized update time of

, which is an exponential improvement over the previous

. Our method also gives the fastest runtime for

-clustering even in the offline setting. For subspace embeddings in the streaming model, we achieve

update time and space-optimal constructions, using

words for

and

words for

, showing that streaming algorithms can match offline algorithms in both space and time complexity.

Fast, Space-Optimal Streaming Algorithms for Clustering and Subspace Embeddings

TL;DR

Abstract

Fast, Space-Optimal Streaming Algorithms for Clustering and Subspace Embeddings

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (89)