Fully Scalable MPC Algorithms for Clustering in High Dimension

Artur Czumaj; Guichen Gao; Shaofeng H. -C. Jiang; Robert Krauthgamer; Pavel Veselý

Fully Scalable MPC Algorithms for Clustering in High Dimension

Artur Czumaj, Guichen Gao, Shaofeng H. -C. Jiang, Robert Krauthgamer, Pavel Veselý

TL;DR

This work addresses high-dimensional clustering within the fully-scalable MPC model, where local memory per machine scales as a small polynomial in the input size. It introduces a robust geometric-aggregation primitive based on consistent hashing, enabling O(1)-round solutions for Power-$z$ Facility Location and, via a facility-location-based reduction, for $(k,z)$-Clustering with controllable bicriteria guarantees. The main technical contributions include a parallel Mettu-Plaxton-style approach for facility opening, precise approximation analyses (including $O(1)$-approximation for facility location and $(O_ ext{ε}(μ^{-2}),1+μ)$-bicriteria for clustering), and weak-coreset techniques enabling near-$k$ solutions with limited communication. The results significantly advance the state of fully-scalable MPC clustering by achieving constant rounds in high dimensions and providing practical constructions for distributed clustering on massive datasets. The methods have potential impact for large-scale data analysis in MapReduce-like systems, enabling provably efficient high-dimensional clustering with strong approximation guarantees.

Abstract

We design new parallel algorithms for clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine may be $n^σ$ for arbitrarily small fixed $σ>0$. Importantly, the local memory may be substantially smaller than the number of clusters $k$, yet all our algorithms are fast, i.e., run in $O(1)$ rounds. We first devise a fast MPC algorithm for $O(1)$-approximation of uniform facility location. This is the first fully-scalable MPC algorithm that achieves $O(1)$-approximation for any clustering problem in general geometric setting; previous algorithms only provide $\mathrm{poly}(\log n)$-approximation or apply to restricted inputs, like low dimension or small number of clusters $k$; e.g. [Bhaskara and Wijewardena, ICML'18; Cohen-Addad et al., NeurIPS'21; Cohen-Addad et al., ICML'22]. We then build on this facility location result and devise a fast MPC algorithm that achieves $O(1)$-bicriteria approximation for $k$-Median and for $k$-Means, namely, it computes $(1+\varepsilon)k$ clusters of cost within $O(1/\varepsilon^2)$-factor of the optimum for $k$ clusters. A primary technical tool that we introduce, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing for every data point a statistic of its approximate neighborhood, for statistics like range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22].

Fully Scalable MPC Algorithms for Clustering in High Dimension

TL;DR

Facility Location and, via a facility-location-based reduction, for

-Clustering with controllable bicriteria guarantees. The main technical contributions include a parallel Mettu-Plaxton-style approach for facility opening, precise approximation analyses (including

-approximation for facility location and

-bicriteria for clustering), and weak-coreset techniques enabling near-

solutions with limited communication. The results significantly advance the state of fully-scalable MPC clustering by achieving constant rounds in high dimensions and providing practical constructions for distributed clustering on massive datasets. The methods have potential impact for large-scale data analysis in MapReduce-like systems, enabling provably efficient high-dimensional clustering with strong approximation guarantees.

Abstract

for arbitrarily small fixed

. Importantly, the local memory may be substantially smaller than the number of clusters

, yet all our algorithms are fast, i.e., run in

rounds. We first devise a fast MPC algorithm for

-approximation of uniform facility location. This is the first fully-scalable MPC algorithm that achieves

-approximation for any clustering problem in general geometric setting; previous algorithms only provide

-approximation or apply to restricted inputs, like low dimension or small number of clusters

; e.g. [Bhaskara and Wijewardena, ICML'18; Cohen-Addad et al., NeurIPS'21; Cohen-Addad et al., ICML'22]. We then build on this facility location result and devise a fast MPC algorithm that achieves

-bicriteria approximation for

-Median and for

-Means, namely, it computes

clusters of cost within

-factor of the optimum for

clusters. A primary technical tool that we introduce, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing for every data point a statistic of its approximate neighborhood, for statistics like range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22].

Paper Structure (45 sections, 19 theorems, 46 equations, 7 algorithms)

This paper contains 45 sections, 19 theorems, 46 equations, 7 algorithms.

Introduction
Our Results
$k$-Clustering.
Technical Overview
Facility Location.
MPC Primitive for Geometric Aggregation in High Dimension.
Computing A Solution for Facility Location.
$k$-Clustering.
Related Work
Parallel and Distributed Algorithms.
Connections to Streaming.
MPC Algorithms for MST.
Preliminaries
Power-$z$ (Uniform) Facility Location.
$(k,z)$-Clustering.
...and 30 more sections

Key Result

Theorem 1.1

Let $\varepsilon,\sigma \in (0,1)$ be fixed. There is a randomized fully-scalable MPC algorithm that, given a multiset $P\subset{\mathbb{R}}^d$ of $n$ points distributed across machines with local memory of size $s \ge n^{\sigma}\cdot\mathop{\mathrm{poly}}\nolimits(d)$, computes in $O_\sigma(1)$ rou

Theorems & Definitions (41)

Theorem 1.1: Simplified version; see \ref{['thm:ufl']}
Remark 1.2
Theorem 1.3: Simplified version; see \ref{['thm:clustering']}
Definition 2.2: arxiv.2204.02095
Lemma 2.3: arxiv.2204.02095
Remark 2.4
Theorem 3.1: Geometric Aggregation in MPC
proof
Lemma 3.2: Sorting in MPC DBLP:conf/isaac/GoodrichSZ11
Theorem 4.1
...and 31 more

Fully Scalable MPC Algorithms for Clustering in High Dimension

TL;DR

Abstract

Fully Scalable MPC Algorithms for Clustering in High Dimension

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (41)