Table of Contents
Fetching ...

A Faster $k$-means++ Algorithm

Jiehao Liang, Somdeb Sarkhel, Zhao Song, Chenbo Yin, Junze Yin, Danyang Zhuo

TL;DR

The paper addresses speeding up the initialization phase of $k$-means clustering by introducing FastKMeans++, which leverages a DistanceOracle built on JL-based sketches to approximate distances. This approach decouples the high-dimensional distance calculations from the core iteration, achieving a nearly optimal total runtime of $\widetilde{O}(nd + nk^2)$ while preserving a constant-factor approximation to the optimal centers. The authors prove a formal bound: $\mathbb{E}[\mathrm{cost}(P, C)] = O(\mathrm{cost}(P, C^*))$ and provide a running time of $O(\varepsilon^{-2} n (d + k^2 \log \log k) \log(n/\delta))$ with space $O(n(d+k+\varepsilon^{-2} \log(n/\delta)))$, along with $O(\varepsilon^{-2} n k \log(n/\delta))$ time for the LocalSearch++ component. Empirical results on synthetic and real datasets demonstrate practical speedups, especially in high dimensions, validating the method's scalability. Overall, the work advances scalable clustering initialization by blending distance-sketching and data-structure tricks to achieve both theoretical and practical efficiency gains.

Abstract

$k$-means++ is an important algorithm for choosing initial cluster centers for the $k$-means clustering algorithm. In this work, we present a new algorithm that can solve the $k$-means++ problem with nearly optimal running time. Given $n$ data points in $\mathbb{R}^d$, the current state-of-the-art algorithm runs in $\widetilde{O}(k )$ iterations, and each iteration takes $\widetilde{O}(nd k)$ time. The overall running time is thus $\widetilde{O}(n d k^2)$. We propose a new algorithm \textsc{FastKmeans++} that only takes in $\widetilde{O}(nd + nk^2)$ time, in total.

A Faster $k$-means++ Algorithm

TL;DR

The paper addresses speeding up the initialization phase of -means clustering by introducing FastKMeans++, which leverages a DistanceOracle built on JL-based sketches to approximate distances. This approach decouples the high-dimensional distance calculations from the core iteration, achieving a nearly optimal total runtime of while preserving a constant-factor approximation to the optimal centers. The authors prove a formal bound: and provide a running time of with space , along with time for the LocalSearch++ component. Empirical results on synthetic and real datasets demonstrate practical speedups, especially in high dimensions, validating the method's scalability. Overall, the work advances scalable clustering initialization by blending distance-sketching and data-structure tricks to achieve both theoretical and practical efficiency gains.

Abstract

-means++ is an important algorithm for choosing initial cluster centers for the -means clustering algorithm. In this work, we present a new algorithm that can solve the -means++ problem with nearly optimal running time. Given data points in , the current state-of-the-art algorithm runs in iterations, and each iteration takes time. The overall running time is thus . We propose a new algorithm \textsc{FastKmeans++} that only takes in time, in total.
Paper Structure (30 sections, 21 theorems, 39 equations, 11 figures, 1 table, 4 algorithms)

This paper contains 30 sections, 21 theorems, 39 equations, 11 figures, 1 table, 4 algorithms.

Key Result

Theorem 1.1

Given point set $P \subset \mathbb{R}^d$ and $Z = \widetilde{O}(k)$, the running time of Algorithm alg:k_means is $\widetilde{O}(n(d + k^2))$ which uses $O(n(d + k))$ space. We use $C$ to denote the result of Algorithm alg:k_means. Let $C^*$ be the set of optimum centers. Then we have

Figures (11)

  • Figure 1: The relationship between each parameter and the running time, where original algorithm denotes Algorithm in ls19, and our algorithm denotes FastKMeans++ in Theorem \ref{['thm:k_means_formal']}. Let $n$ be the number of points in the point set. Let $d$ denote the dimension of each node. Let $m$ denote the dimension of each node after we process them with a sketching matrix. Let $k$ be the number of clusters and centers.
  • Figure 2: The running time of $k$-means++ and FastKMeans++ algorithm on SCADI and MUPCI data set. $m$ denotes the dimension of each node after we transformed all of them, and $k$ denotes the number of clusters. There are two figures for each data set. The left one shows the relationship between the running time of two algorithms and $m$, and the right one shows the relationship between the running time of two algorithms and $k$.
  • Figure 3: The running time of $k$-means++ and FastKMeans++ algorithm on Libras Movement and STDW data set. $m$ denotes the dimension of each node after we transformed all of them, and $k$ denotes the number of clusters. There are two figures for each data set. The left one shows the relationship between the running time of two algorithms and $m$, and the right one shows the relationship between the running time of two algorithms and $k$.
  • Figure 4: The relationship between running time of original $k$-means++ algorithm and parameter $n$
  • Figure 5: The relationship between running time of our FastKMeans++ algorithm and parameter $n$
  • ...and 6 more figures

Theorems & Definitions (44)

  • Theorem 1.1: Informal Version of Theorem \ref{['thm:k_means_formal']}
  • Definition 3.1: Cost
  • Definition 3.2: Mean
  • Lemma 3.3: JL Lemma, jl84
  • Lemma 3.4
  • Lemma 3.5
  • Theorem 4.1: Distance Oracle
  • proof
  • Lemma 4.2: Running Time of Init
  • proof
  • ...and 34 more