Efficient Centroid-Linkage Clustering
MohammadHossein Bateni, Laxman Dhulipala, Willem Fletcher, Kishen N Gowda, D Ellis Hershkowitz, Rajesh Jayaram, Jakub Łącki
TL;DR
This paper introduces a subquadratic, $c$-approximate Centroid-Linkage HAC by integrating a meta-algorithm that leverages dynamic ANNS on cluster centroids with a dynamic ANNS robust to adaptive updates. It develops a black-box reduction from oblivious to adaptive ANNS, enabling a $O(c)$-approximate HAC with runtime near $n^{1+1/c^2}$ and proofs of correctness under adaptive updates. Empirically, the approach achieves clustering quality close to exact centroid HAC while delivering substantial speedups (up to $36\times$ on a single core and up to $175\times$ on many cores) across standard benchmarks, using DiskANN-based dynamic ANNS. The work offers practical and theoretical advances for scalable HAC in high dimensions and highlights open challenges in parallelizing approximate centroid HAC further.
Abstract
We give an efficient algorithm for Centroid-Linkage Hierarchical Agglomerative Clustering (HAC), which computes a $c$-approximate clustering in roughly $n^{1+O(1/c^2)}$ time. We obtain our result by combining a new Centroid-Linkage HAC algorithm with a novel fully dynamic data structure for nearest neighbor search which works under adaptive updates. We also evaluate our algorithm empirically. By leveraging a state-of-the-art nearest-neighbor search library, we obtain a fast and accurate Centroid-Linkage HAC algorithm. Compared to an existing state-of-the-art exact baseline, our implementation maintains the clustering quality while delivering up to a $36\times$ speedup due to performing fewer distance comparisons.
