Table of Contents
Fetching ...

Space Complexity of Euclidean Clustering

Xiaoyi Zhu, Yuxiang Tian, Lingxiao Huang, Zengfeng Huang

TL;DR

This work initiates the study of space complexity for Euclidean $(k,z)$-Clustering, establishing both upper and lower bounds that tie compression quality to coresets and dimension. It shows that, when $k$ is constant, coreset-based Sketches are essentially optimal in space usage, with a tight lower bound of $\Theta(n d)$ for terminal embedding under common dimension regimes. The authors develop a novel lower-bound framework based on principal angles and discrepancy (partial coloring) to separate costs across dataset families, and pair it with a practical upper-bound framework using coresets plus quantization. They further connect these space bounds to distributed and streaming settings, providing explicit bit- and communication-cost tradeoffs, and reveal a nearly tight space bound for terminal embedding that complements existing dimension-reduction techniques. Overall, the paper links space complexity tightly with coresets and principal-angles-geometric insights, offering tools and directions for both theory and scalable clustering practice.

Abstract

The $(k, z)$-Clustering problem in Euclidean space $\mathbb{R}^d$ has been extensively studied. Given the scale of data involved, compression methods for the Euclidean $(k, z)$-Clustering problem, such as data compression and dimension reduction, have received significant attention in the literature. However, the space complexity of the clustering problem, specifically, the number of bits required to compress the cost function within a multiplicative error $\varepsilon$, remains unclear in existing literature. This paper initiates the study of space complexity for Euclidean $(k, z)$-Clustering and offers both upper and lower bounds. Our space bounds are nearly tight when $k$ is constant, indicating that storing a coreset, a well-known data compression approach, serves as the optimal compression scheme. Furthermore, our lower bound result for $(k, z)$-Clustering establishes a tight space bound of $Θ( n d )$ for terminal embedding, where $n$ represents the dataset size. Our technical approach leverages new geometric insights for principal angles and discrepancy methods, which may hold independent interest.

Space Complexity of Euclidean Clustering

TL;DR

This work initiates the study of space complexity for Euclidean -Clustering, establishing both upper and lower bounds that tie compression quality to coresets and dimension. It shows that, when is constant, coreset-based Sketches are essentially optimal in space usage, with a tight lower bound of for terminal embedding under common dimension regimes. The authors develop a novel lower-bound framework based on principal angles and discrepancy (partial coloring) to separate costs across dataset families, and pair it with a practical upper-bound framework using coresets plus quantization. They further connect these space bounds to distributed and streaming settings, providing explicit bit- and communication-cost tradeoffs, and reveal a nearly tight space bound for terminal embedding that complements existing dimension-reduction techniques. Overall, the paper links space complexity tightly with coresets and principal-angles-geometric insights, offering tools and directions for both theory and scalable clustering practice.

Abstract

The -Clustering problem in Euclidean space has been extensively studied. Given the scale of data involved, compression methods for the Euclidean -Clustering problem, such as data compression and dimension reduction, have received significant attention in the literature. However, the space complexity of the clustering problem, specifically, the number of bits required to compress the cost function within a multiplicative error , remains unclear in existing literature. This paper initiates the study of space complexity for Euclidean -Clustering and offers both upper and lower bounds. Our space bounds are nearly tight when is constant, indicating that storing a coreset, a well-known data compression approach, serves as the optimal compression scheme. Furthermore, our lower bound result for -Clustering establishes a tight space bound of for terminal embedding, where represents the dataset size. Our technical approach leverages new geometric insights for principal angles and discrepancy methods, which may hold independent interest.
Paper Structure (18 sections, 27 theorems, 136 equations, 1 figure, 2 tables, 1 algorithm)

This paper contains 18 sections, 27 theorems, 136 equations, 1 figure, 2 tables, 1 algorithm.

Key Result

Theorem 1.2

Suppose for any dataset $\boldsymbol{P}\subseteq [\Delta]^d$ of size $n$, there exists an $\varepsilon$-coreset of $\boldsymbol{P}$ for $(k, z)$-Clustering of size at most $\Psi(n) \geq 1$. We have the following space upper bounds:

Figures (1)

  • Figure 1: Example of principal angles of two distinct planes in $\mathbb{R}^3$ sharing a line .

Theorems & Definitions (59)

  • Definition 1.1: Space complexity for Euclidean $(k, z)$-Clustering
  • Theorem 1.2: Space upper bounds
  • Theorem 1.3: Space lower bounds
  • Definition 1.4: Terminal embedding
  • Theorem 1.5: Informal; see \ref{['thm:embedding']}
  • Lemma 2.1: Relaxed triangle inequality (Lemma 10 of cohen2022towards)
  • Definition 2.2: Partial Coloring
  • Definition 2.3: Principal angles
  • Lemma 2.4: Property of principal angles (Theorem 1 in bjorck1973numerical)
  • Lemma 3.1: Sum of weights
  • ...and 49 more