A Faster $k$-means++ Algorithm
Jiehao Liang, Somdeb Sarkhel, Zhao Song, Chenbo Yin, Junze Yin, Danyang Zhuo
TL;DR
The paper addresses speeding up the initialization phase of $k$-means clustering by introducing FastKMeans++, which leverages a DistanceOracle built on JL-based sketches to approximate distances. This approach decouples the high-dimensional distance calculations from the core iteration, achieving a nearly optimal total runtime of $\widetilde{O}(nd + nk^2)$ while preserving a constant-factor approximation to the optimal centers. The authors prove a formal bound: $\mathbb{E}[\mathrm{cost}(P, C)] = O(\mathrm{cost}(P, C^*))$ and provide a running time of $O(\varepsilon^{-2} n (d + k^2 \log \log k) \log(n/\delta))$ with space $O(n(d+k+\varepsilon^{-2} \log(n/\delta)))$, along with $O(\varepsilon^{-2} n k \log(n/\delta))$ time for the LocalSearch++ component. Empirical results on synthetic and real datasets demonstrate practical speedups, especially in high dimensions, validating the method's scalability. Overall, the work advances scalable clustering initialization by blending distance-sketching and data-structure tricks to achieve both theoretical and practical efficiency gains.
Abstract
$k$-means++ is an important algorithm for choosing initial cluster centers for the $k$-means clustering algorithm. In this work, we present a new algorithm that can solve the $k$-means++ problem with nearly optimal running time. Given $n$ data points in $\mathbb{R}^d$, the current state-of-the-art algorithm runs in $\widetilde{O}(k )$ iterations, and each iteration takes $\widetilde{O}(nd k)$ time. The overall running time is thus $\widetilde{O}(n d k^2)$. We propose a new algorithm \textsc{FastKmeans++} that only takes in $\widetilde{O}(nd + nk^2)$ time, in total.
