Table of Contents
Fetching ...

Distributed Algorithms for Euclidean Clustering

Vincent Cohen-Addad, Liudeng Wang, David P. Woodruff, Samson Zhou

TL;DR

The techniques combine new strategies for constant-factor approximation with efficient coreset constructions and compact encoding schemes, leading to optimal protocols that match both the communication costs of the best-known offline coreset constructions and existing lower bounds up to polylogarithmic factors.

Abstract

We study the problem of constructing $(1+\varepsilon)$-coresets for Euclidean $(k,z)$-clustering in the distributed setting, where $n$ data points are partitioned across $s$ sites. We focus on two prominent communication models: the coordinator model and the blackboard model. In the coordinator model, we design a protocol that achieves a $(1+\varepsilon)$-strong coreset with total communication complexity $\tilde{O}\left(sk + \frac{dk}{\min(\varepsilon^4,\varepsilon^{2+z})} + dk\log(nΔ)\right)$ bits, improving upon prior work (Chen et al., NeurIPS 2016) by eliminating the need to communicate explicit point coordinates in-the-clear across all servers. In the blackboard model, we further reduce the communication complexity to $\tilde{O}\left(s\log(nΔ) + dk\log(nΔ) + \frac{dk}{\min(\varepsilon^4,\varepsilon^{2+z})}\right)$ bits, achieving better bounds than previous approaches while upgrading from constant-factor to $(1+\varepsilon)$-approximation guarantees. Our techniques combine new strategies for constant-factor approximation with efficient coreset constructions and compact encoding schemes, leading to optimal protocols that match both the communication costs of the best-known offline coreset constructions and existing lower bounds (Chen et al., NeurIPS 2016, Huang et. al., STOC 2024), up to polylogarithmic factors.

Distributed Algorithms for Euclidean Clustering

TL;DR

The techniques combine new strategies for constant-factor approximation with efficient coreset constructions and compact encoding schemes, leading to optimal protocols that match both the communication costs of the best-known offline coreset constructions and existing lower bounds up to polylogarithmic factors.

Abstract

We study the problem of constructing -coresets for Euclidean -clustering in the distributed setting, where data points are partitioned across sites. We focus on two prominent communication models: the coordinator model and the blackboard model. In the coordinator model, we design a protocol that achieves a -strong coreset with total communication complexity bits, improving upon prior work (Chen et al., NeurIPS 2016) by eliminating the need to communicate explicit point coordinates in-the-clear across all servers. In the blackboard model, we further reduce the communication complexity to bits, achieving better bounds than previous approaches while upgrading from constant-factor to -approximation guarantees. Our techniques combine new strategies for constant-factor approximation with efficient coreset constructions and compact encoding schemes, leading to optimal protocols that match both the communication costs of the best-known offline coreset constructions and existing lower bounds (Chen et al., NeurIPS 2016, Huang et. al., STOC 2024), up to polylogarithmic factors.
Paper Structure (42 sections, 47 theorems, 67 equations, 8 figures, 16 algorithms)

This paper contains 42 sections, 47 theorems, 67 equations, 8 figures, 16 algorithms.

Key Result

Theorem 1.1

Given accuracy parameter $\varepsilon\in(0,1)$, there exists a protocol on $n$ points distributed across $s$ sites that produces a $(1+\varepsilon)$-strong coreset for $(k,z)$-clustering that uses $\tilde{\mathcal{O}}\left(sk + \frac{dk}{\min(\varepsilon^4, \varepsilon^{2+z} )} + dk \log (n \Delta)\

Figures (8)

  • Figure 1: Table of $(k,z)$-clustering algorithms in the distributed setting. We remark that ChenSWZ16 only achieves a constant-factor-approximation, whereas we achieve a $(1+\varepsilon)$-approximation.
  • Figure 2: Informal version of bicriteria approximation through adaptive sampling.
  • Figure 3: Informal version of efficient communication in the message-passing algorithm. For full algorithm, see \ref{['alg:alg:efficient:communication']}.
  • Figure 4: Informal version of message-passing algorithm. For full algorithm, see \ref{['alg:alg:coreset:coordinator']}.
  • Figure 5: Experiments for clustering costs and communication costs on DIGITS dataset
  • ...and 3 more figures

Theorems & Definitions (74)

  • Theorem 1.1: Communication-optimal clustering in the coordinator model, informal
  • Theorem 1.2: Communication-optimal clustering in the blackboard model, informal
  • Definition 2.3: Coreset
  • Theorem 2.4
  • Theorem 2.5: Johnson-Lindenstrauss lemma
  • Theorem 2.6: Hoeffding's inequality
  • Theorem 2.7
  • Theorem 2.8
  • Theorem 2.9
  • Theorem 2.10
  • ...and 64 more