Space Complexity of Euclidean Clustering

Xiaoyi Zhu; Yuxiang Tian; Lingxiao Huang; Zengfeng Huang

Space Complexity of Euclidean Clustering

Xiaoyi Zhu, Yuxiang Tian, Lingxiao Huang, Zengfeng Huang

TL;DR

This work initiates the study of space complexity for Euclidean $(k,z)$-Clustering, establishing both upper and lower bounds that tie compression quality to coresets and dimension. It shows that, when $k$ is constant, coreset-based Sketches are essentially optimal in space usage, with a tight lower bound of $\Theta(n d)$ for terminal embedding under common dimension regimes. The authors develop a novel lower-bound framework based on principal angles and discrepancy (partial coloring) to separate costs across dataset families, and pair it with a practical upper-bound framework using coresets plus quantization. They further connect these space bounds to distributed and streaming settings, providing explicit bit- and communication-cost tradeoffs, and reveal a nearly tight space bound for terminal embedding that complements existing dimension-reduction techniques. Overall, the paper links space complexity tightly with coresets and principal-angles-geometric insights, offering tools and directions for both theory and scalable clustering practice.

Abstract

The $(k, z)$-Clustering problem in Euclidean space $\mathbb{R}^d$ has been extensively studied. Given the scale of data involved, compression methods for the Euclidean $(k, z)$-Clustering problem, such as data compression and dimension reduction, have received significant attention in the literature. However, the space complexity of the clustering problem, specifically, the number of bits required to compress the cost function within a multiplicative error $\varepsilon$, remains unclear in existing literature. This paper initiates the study of space complexity for Euclidean $(k, z)$-Clustering and offers both upper and lower bounds. Our space bounds are nearly tight when $k$ is constant, indicating that storing a coreset, a well-known data compression approach, serves as the optimal compression scheme. Furthermore, our lower bound result for $(k, z)$-Clustering establishes a tight space bound of $Θ( n d )$ for terminal embedding, where $n$ represents the dataset size. Our technical approach leverages new geometric insights for principal angles and discrepancy methods, which may hold independent interest.

Space Complexity of Euclidean Clustering

TL;DR

This work initiates the study of space complexity for Euclidean

-Clustering, establishing both upper and lower bounds that tie compression quality to coresets and dimension. It shows that, when

is constant, coreset-based Sketches are essentially optimal in space usage, with a tight lower bound of

for terminal embedding under common dimension regimes. The authors develop a novel lower-bound framework based on principal angles and discrepancy (partial coloring) to separate costs across dataset families, and pair it with a practical upper-bound framework using coresets plus quantization. They further connect these space bounds to distributed and streaming settings, providing explicit bit- and communication-cost tradeoffs, and reveal a nearly tight space bound for terminal embedding that complements existing dimension-reduction techniques. Overall, the paper links space complexity tightly with coresets and principal-angles-geometric insights, offering tools and directions for both theory and scalable clustering practice.

Abstract

The

-Clustering problem in Euclidean space

has been extensively studied. Given the scale of data involved, compression methods for the Euclidean

-Clustering problem, such as data compression and dimension reduction, have received significant attention in the literature. However, the space complexity of the clustering problem, specifically, the number of bits required to compress the cost function within a multiplicative error

, remains unclear in existing literature. This paper initiates the study of space complexity for Euclidean

-Clustering and offers both upper and lower bounds. Our space bounds are nearly tight when

is constant, indicating that storing a coreset, a well-known data compression approach, serves as the optimal compression scheme. Furthermore, our lower bound result for

-Clustering establishes a tight space bound of

for terminal embedding, where

represents the dataset size. Our technical approach leverages new geometric insights for principal angles and discrepancy methods, which may hold independent interest.

Paper Structure (18 sections, 27 theorems, 136 equations, 1 figure, 2 tables, 1 algorithm)

This paper contains 18 sections, 27 theorems, 136 equations, 1 figure, 2 tables, 1 algorithm.

Introduction
Problem Definition and Our Results
Technical Overview
Other Related Work
Preliminaries
Proof of Theorem \ref{['thm:main_upper']}: Space Upper Bounds
Proof of Theorem \ref{['thm:main_lower']}: Space Lower Bounds
Proof of Theorem \ref{['thm:main_lower']}
Proof of Lemma \ref{['lmm:large_cost']}: Principal Angles to Cost Difference
Proof of Lemma \ref{['lmm:large_size']}: Construction of A Large Family $\mathcal{P}$
Extension to General $z\geq 1$
Extension to General $k\geq 2$
Application to Space Lower Bound for Terminal Embedding
Application of Coreset Construction in Distributed and Streaming Settings
Communication Cost for Distributed $(k, z)$-Clustering
...and 3 more sections

Key Result

Theorem 1.2

Suppose for any dataset $\boldsymbol{P}\subseteq [\Delta]^d$ of size $n$, there exists an $\varepsilon$-coreset of $\boldsymbol{P}$ for $(k, z)$-Clustering of size at most $\Psi(n) \geq 1$. We have the following space upper bounds:

Figures (1)

Figure 1: Example of principal angles of two distinct planes in $\mathbb{R}^3$ sharing a line .

Theorems & Definitions (59)

Definition 1.1: Space complexity for Euclidean $(k, z)$-Clustering
Theorem 1.2: Space upper bounds
Theorem 1.3: Space lower bounds
Definition 1.4: Terminal embedding
Theorem 1.5: Informal; see \ref{['thm:embedding']}
Lemma 2.1: Relaxed triangle inequality (Lemma 10 of cohen2022towards)
Definition 2.2: Partial Coloring
Definition 2.3: Principal angles
Lemma 2.4: Property of principal angles (Theorem 1 in bjorck1973numerical)
Lemma 3.1: Sum of weights
...and 49 more

Space Complexity of Euclidean Clustering

TL;DR

Abstract

Space Complexity of Euclidean Clustering

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (59)