Table of Contents
Fetching ...

Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search

Jack Spalding-Jamieson, Eliot Wong Robson, Da Wei Zheng

TL;DR

This work addresses scalable $k$-means clustering for very large $k$ on massive, high-dimensional datasets by identifying Lloyd's reassignment as the bottleneck and introducing Seeded Approximate Nearest-Neighbor Search (SANNS) and Seeded Search-Graphs to accelerate it. The authors propose SHEESH, a practical algorithm that uses previous iterations' seeds, bulk seeded queries, and continuous rebuilds of seed-aware search graphs (based on HNSW) to speed up the assignment step without specialized hardware. Empirical results on large image/text embeddings show substantial speedups over GPU Lloyd and competitive performance with other ANNS baselines, validating the practicality of seeded-graph strategies for large-$k$ clustering. The approach offers a scalable, CPU-friendly pathway for high-dimensional, large-scale clustering with potential extensions to other data regimes and hardware accelerations.

Abstract

For very large values of $k$, we consider methods for fast $k$-means clustering of massive datasets with $10^7\sim10^9$ points in high-dimensions ($d\geq100$). All current practical methods for this problem have runtimes at least $Ω(k^2)$. We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call "Seeded Approximate Nearest-Neighbor Search", for which we propose "Seeded Search-Graph" methods as a solution.

Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search

TL;DR

This work addresses scalable -means clustering for very large on massive, high-dimensional datasets by identifying Lloyd's reassignment as the bottleneck and introducing Seeded Approximate Nearest-Neighbor Search (SANNS) and Seeded Search-Graphs to accelerate it. The authors propose SHEESH, a practical algorithm that uses previous iterations' seeds, bulk seeded queries, and continuous rebuilds of seed-aware search graphs (based on HNSW) to speed up the assignment step without specialized hardware. Empirical results on large image/text embeddings show substantial speedups over GPU Lloyd and competitive performance with other ANNS baselines, validating the practicality of seeded-graph strategies for large- clustering. The approach offers a scalable, CPU-friendly pathway for high-dimensional, large-scale clustering with potential extensions to other data regimes and hardware accelerations.

Abstract

For very large values of , we consider methods for fast -means clustering of massive datasets with points in high-dimensions (). All current practical methods for this problem have runtimes at least . We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call "Seeded Approximate Nearest-Neighbor Search", for which we propose "Seeded Search-Graph" methods as a solution.

Paper Structure

This paper contains 37 sections, 2 theorems, 1 equation, 8 figures, 3 tables, 4 algorithms.

Key Result

Theorem 4.1

Let $q$ be a query point whose nearest-neighbor in $P$ is $a$. Let $s\in P$ be a point so that $1+\delta\geq\frac{D(s,q)}{D(a,q)}$, for some value $\delta>0$. Then alg:beam starting at $s$ returns a $\left(\frac{\alpha+1}{\alpha-1}+\varepsilon\right)$-approximate nearest-neighbor in $\left\lceil\log

Figures (8)

  • Figure 1: An illustration of the search path formed by \ref{['alg:beam']} on each "level" of the HNSW data structure. The large green sphere denotes the query point, and the search path is highlighted in beige.
  • Figure 2: Comparisons of different initialization methods for $k$-means with $k=10\,000$.
  • Figure 3: Comparison of HNSW as a black-box method for $k$-means clustering vs Lloyd's algorithm on the DPR5M dataset, with $k=10\,000$. Initialization is uniformly random.
  • Figure 4: Comparison of our approach with GPU acceleration, as well as the black-box HNSW approach on the SIFT20M, Text2Image10M, and DPR5M datasets respectively, for $k=10\,000$. Initialization is uniformly random.
  • Figure 5: A plot of SHEESH running on SIFT1B with $k=1\,000\,000$ for just over $12$ hours. Initialization is uniformly random. We estimate SciKit-Learn would take roughly 9.5 days to run a single iteration in this case.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Theorem 4.1
  • proof
  • Corollary 4.2