A Faster $k$-means++ Algorithm

Jiehao Liang; Somdeb Sarkhel; Zhao Song; Chenbo Yin; Junze Yin; Danyang Zhuo

A Faster $k$-means++ Algorithm

Jiehao Liang, Somdeb Sarkhel, Zhao Song, Chenbo Yin, Junze Yin, Danyang Zhuo

TL;DR

The paper addresses speeding up the initialization phase of $k$-means clustering by introducing FastKMeans++, which leverages a DistanceOracle built on JL-based sketches to approximate distances. This approach decouples the high-dimensional distance calculations from the core iteration, achieving a nearly optimal total runtime of $\widetilde{O}(nd + nk^2)$ while preserving a constant-factor approximation to the optimal centers. The authors prove a formal bound: $\mathbb{E}[\mathrm{cost}(P, C)] = O(\mathrm{cost}(P, C^*))$ and provide a running time of $O(\varepsilon^{-2} n (d + k^2 \log \log k) \log(n/\delta))$ with space $O(n(d+k+\varepsilon^{-2} \log(n/\delta)))$, along with $O(\varepsilon^{-2} n k \log(n/\delta))$ time for the LocalSearch++ component. Empirical results on synthetic and real datasets demonstrate practical speedups, especially in high dimensions, validating the method's scalability. Overall, the work advances scalable clustering initialization by blending distance-sketching and data-structure tricks to achieve both theoretical and practical efficiency gains.

Abstract

$k$-means++ is an important algorithm for choosing initial cluster centers for the $k$-means clustering algorithm. In this work, we present a new algorithm that can solve the $k$-means++ problem with nearly optimal running time. Given $n$ data points in $\mathbb{R}^d$, the current state-of-the-art algorithm runs in $\widetilde{O}(k )$ iterations, and each iteration takes $\widetilde{O}(nd k)$ time. The overall running time is thus $\widetilde{O}(n d k^2)$. We propose a new algorithm \textsc{FastKmeans++} that only takes in $\widetilde{O}(nd + nk^2)$ time, in total.

A Faster $k$-means++ Algorithm

TL;DR

The paper addresses speeding up the initialization phase of

-means clustering by introducing FastKMeans++, which leverages a DistanceOracle built on JL-based sketches to approximate distances. This approach decouples the high-dimensional distance calculations from the core iteration, achieving a nearly optimal total runtime of

while preserving a constant-factor approximation to the optimal centers. The authors prove a formal bound:

and provide a running time of

with space

, along with

time for the LocalSearch++ component. Empirical results on synthetic and real datasets demonstrate practical speedups, especially in high dimensions, validating the method's scalability. Overall, the work advances scalable clustering initialization by blending distance-sketching and data-structure tricks to achieve both theoretical and practical efficiency gains.

Abstract

-means++ is an important algorithm for choosing initial cluster centers for the

-means clustering algorithm. In this work, we present a new algorithm that can solve the

-means++ problem with nearly optimal running time. Given

data points in

, the current state-of-the-art algorithm runs in

iterations, and each iteration takes

time. The overall running time is thus

. We propose a new algorithm \textsc{FastKmeans++} that only takes in

time, in total.

Paper Structure (30 sections, 21 theorems, 39 equations, 11 figures, 1 table, 4 algorithms)

This paper contains 30 sections, 21 theorems, 39 equations, 11 figures, 1 table, 4 algorithms.

Introduction
Our Results
Related Work
Clustering
Sketching for Iterative Algorithm
Roadmap
Preliminary
Notation
Related Definitions
Useful Lemmas
Data Structure
Distance Oracle
Running Time of Distance Oracle
Main Result
Running Time and Space
...and 15 more sections

Key Result

Theorem 1.1

Given point set $P \subset \mathbb{R}^d$ and $Z = \widetilde{O}(k)$, the running time of Algorithm alg:k_means is $\widetilde{O}(n(d + k^2))$ which uses $O(n(d + k))$ space. We use $C$ to denote the result of Algorithm alg:k_means. Let $C^*$ be the set of optimum centers. Then we have

Figures (11)

Figure 1: The relationship between each parameter and the running time, where original algorithm denotes Algorithm in ls19, and our algorithm denotes FastKMeans++ in Theorem \ref{['thm:k_means_formal']}. Let $n$ be the number of points in the point set. Let $d$ denote the dimension of each node. Let $m$ denote the dimension of each node after we process them with a sketching matrix. Let $k$ be the number of clusters and centers.
Figure 2: The running time of $k$-means++ and FastKMeans++ algorithm on SCADI and MUPCI data set. $m$ denotes the dimension of each node after we transformed all of them, and $k$ denotes the number of clusters. There are two figures for each data set. The left one shows the relationship between the running time of two algorithms and $m$, and the right one shows the relationship between the running time of two algorithms and $k$.
Figure 3: The running time of $k$-means++ and FastKMeans++ algorithm on Libras Movement and STDW data set. $m$ denotes the dimension of each node after we transformed all of them, and $k$ denotes the number of clusters. There are two figures for each data set. The left one shows the relationship between the running time of two algorithms and $m$, and the right one shows the relationship between the running time of two algorithms and $k$.
Figure 4: The relationship between running time of original $k$-means++ algorithm and parameter $n$
Figure 5: The relationship between running time of our FastKMeans++ algorithm and parameter $n$
...and 6 more figures

Theorems & Definitions (44)

Theorem 1.1: Informal Version of Theorem \ref{['thm:k_means_formal']}
Definition 3.1: Cost
Definition 3.2: Mean
Lemma 3.3: JL Lemma, jl84
Lemma 3.4
Lemma 3.5
Theorem 4.1: Distance Oracle
proof
Lemma 4.2: Running Time of Init
proof
...and 34 more

A Faster $k$-means++ Algorithm

TL;DR

Abstract

A Faster $k$-means++ Algorithm

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (44)