Table of Contents
Fetching ...

Fast, Space-Optimal Streaming Algorithms for Clustering and Subspace Embeddings

Vincent Cohen-Addad, Liudeng Wang, David P. Woodruff, Samson Zhou

TL;DR

This work resolves a central question in data streams by showing that core tasks—$(k,z)$-clustering and $L_p$ subspace embeddings—can be performed in one pass with space and time comparable to offline algorithms. The authors develop a unified framework that combines crude-and-refined sampling, merge-and-reduce, and an efficient global encoding to produce $(1+ ext{ε})$-strong coresets whose size is independent of the input size $n$, while achieving amortized update times of $ ilde{O}(d ext{log}(k))$ for clustering and $O(d)$ for subspace embeddings. For clustering, they prove a close connection between online clustering sensitivity and $(k,z)$-medoids sensitivity, enabling efficient, sublinear-time streaming algorithms that match the offline core-set bounds up to polylog factors. For subspace embeddings, the methods yield space-optimal streaming constructions with tight dependence on dimension $d$ and exponent $p$, showing there is no inherent overhead in streaming relative to offline for these tasks. Together, these results bridge the gap between streaming and offline models, enabling scalable, real-time data analysis with strong theoretical guarantees and practical encoding schemes.

Abstract

We show that both clustering and subspace embeddings can be performed in the streaming model with the same asymptotic efficiency as in the central/offline setting. For $(k, z)$-clustering in the streaming model, we achieve a number of words of memory which is independent of the number $n$ of input points and the aspect ratio $Δ$, yielding an optimal bound of $\tilde{\mathcal{O}}\left(\frac{dk}{\min(\varepsilon^4,\varepsilon^{z+2})}\right)$ words for accuracy parameter $\varepsilon$ on $d$-dimensional points. Additionally, we obtain amortized update time of $d\,\log(k)\cdot\text{polylog}(\log(nΔ))$, which is an exponential improvement over the previous $d\,\text{poly}(k,\log(nΔ))$. Our method also gives the fastest runtime for $(k,z)$-clustering even in the offline setting. For subspace embeddings in the streaming model, we achieve $\mathcal{O}(d)$ update time and space-optimal constructions, using $\tilde{\mathcal{O}}\left(\frac{d^2}{\varepsilon^2}\right)$ words for $p\le 2$ and $\tilde{\mathcal{O}}\left(\frac{d^{p/2+1}}{\varepsilon^2}\right)$ words for $p>2$, showing that streaming algorithms can match offline algorithms in both space and time complexity.

Fast, Space-Optimal Streaming Algorithms for Clustering and Subspace Embeddings

TL;DR

This work resolves a central question in data streams by showing that core tasks—-clustering and subspace embeddings—can be performed in one pass with space and time comparable to offline algorithms. The authors develop a unified framework that combines crude-and-refined sampling, merge-and-reduce, and an efficient global encoding to produce -strong coresets whose size is independent of the input size , while achieving amortized update times of for clustering and for subspace embeddings. For clustering, they prove a close connection between online clustering sensitivity and -medoids sensitivity, enabling efficient, sublinear-time streaming algorithms that match the offline core-set bounds up to polylog factors. For subspace embeddings, the methods yield space-optimal streaming constructions with tight dependence on dimension and exponent , showing there is no inherent overhead in streaming relative to offline for these tasks. Together, these results bridge the gap between streaming and offline models, enabling scalable, real-time data analysis with strong theoretical guarantees and practical encoding schemes.

Abstract

We show that both clustering and subspace embeddings can be performed in the streaming model with the same asymptotic efficiency as in the central/offline setting. For -clustering in the streaming model, we achieve a number of words of memory which is independent of the number of input points and the aspect ratio , yielding an optimal bound of words for accuracy parameter on -dimensional points. Additionally, we obtain amortized update time of , which is an exponential improvement over the previous . Our method also gives the fastest runtime for -clustering even in the offline setting. For subspace embeddings in the streaming model, we achieve update time and space-optimal constructions, using words for and words for , showing that streaming algorithms can match offline algorithms in both space and time complexity.

Paper Structure

This paper contains 35 sections, 49 theorems, 134 equations, 4 figures, 8 algorithms.

Key Result

Theorem 1.1

Given a set $X$ of $n$ points on $[\Delta]^d$ and an accuracy parameter $\varepsilon\in(0,1)$, there is a one-pass insertion-only streaming algorithm that uses $\tilde{\mathcal{O}}\left(\frac{dk}{\min(\varepsilon^4,\varepsilon^{2+z})}\right)$ words of space and $d\log(k)\cdot\mathop{\mathrm{polylog}

Figures (4)

  • Figure 1: High-level summary of our approach
  • Figure 2: Table of $(k,z)$-clustering algorithms on data streams, omitting linear dependencies in the dimension $d$. We remark that HenzingerK20BhattacharyaCLP23BhattacharyaCF24 can handle the fully-dynamic setting, whereas ours cannot. However, our algorithm uses sublinear space while theirs does not.
  • Figure 3: Table of $(k,z)$-clustering algorithms on insertion-only streams. We summarize existing results with $z=\mathcal{O}\left(1\right)$, $\Delta=\mathop{\mathrm{poly}}\limits(n)$, and the assumption that $k>\frac{1}{\varepsilon^z}$ for the purpose of presentation.
  • Figure 4: Table of $L_p$ subspace embedding algorithms on insertion-only streams. We summarize existing results with $\kappa=\mathop{\mathrm{poly}}\limits(n)$ for the purpose of presentation.

Theorems & Definitions (89)

  • Theorem 1.1: Fast and space-optimal clustering
  • Theorem 1.2
  • Theorem 1.3
  • Definition 1.5: Coreset
  • Theorem 1.6
  • Theorem 1.7: Johnson-Lindenstrauss lemma
  • Theorem 1.8: Hoeffding's inequality
  • Definition 2.1: Sensitivities for $(k,z)$-clustering
  • Definition 2.2: Online sensitivity for $(k,z)$-clustering
  • Theorem 2.3
  • ...and 79 more