Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search
Jack Spalding-Jamieson, Eliot Wong Robson, Da Wei Zheng
TL;DR
This work addresses scalable $k$-means clustering for very large $k$ on massive, high-dimensional datasets by identifying Lloyd's reassignment as the bottleneck and introducing Seeded Approximate Nearest-Neighbor Search (SANNS) and Seeded Search-Graphs to accelerate it. The authors propose SHEESH, a practical algorithm that uses previous iterations' seeds, bulk seeded queries, and continuous rebuilds of seed-aware search graphs (based on HNSW) to speed up the assignment step without specialized hardware. Empirical results on large image/text embeddings show substantial speedups over GPU Lloyd and competitive performance with other ANNS baselines, validating the practicality of seeded-graph strategies for large-$k$ clustering. The approach offers a scalable, CPU-friendly pathway for high-dimensional, large-scale clustering with potential extensions to other data regimes and hardware accelerations.
Abstract
For very large values of $k$, we consider methods for fast $k$-means clustering of massive datasets with $10^7\sim10^9$ points in high-dimensions ($d\geq100$). All current practical methods for this problem have runtimes at least $Ω(k^2)$. We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call "Seeded Approximate Nearest-Neighbor Search", for which we propose "Seeded Search-Graph" methods as a solution.
