Table of Contents
Fetching ...

Efficient Centroid-Linkage Clustering

MohammadHossein Bateni, Laxman Dhulipala, Willem Fletcher, Kishen N Gowda, D Ellis Hershkowitz, Rajesh Jayaram, Jakub Łącki

TL;DR

This paper introduces a subquadratic, $c$-approximate Centroid-Linkage HAC by integrating a meta-algorithm that leverages dynamic ANNS on cluster centroids with a dynamic ANNS robust to adaptive updates. It develops a black-box reduction from oblivious to adaptive ANNS, enabling a $O(c)$-approximate HAC with runtime near $n^{1+1/c^2}$ and proofs of correctness under adaptive updates. Empirically, the approach achieves clustering quality close to exact centroid HAC while delivering substantial speedups (up to $36\times$ on a single core and up to $175\times$ on many cores) across standard benchmarks, using DiskANN-based dynamic ANNS. The work offers practical and theoretical advances for scalable HAC in high dimensions and highlights open challenges in parallelizing approximate centroid HAC further.

Abstract

We give an efficient algorithm for Centroid-Linkage Hierarchical Agglomerative Clustering (HAC), which computes a $c$-approximate clustering in roughly $n^{1+O(1/c^2)}$ time. We obtain our result by combining a new Centroid-Linkage HAC algorithm with a novel fully dynamic data structure for nearest neighbor search which works under adaptive updates. We also evaluate our algorithm empirically. By leveraging a state-of-the-art nearest-neighbor search library, we obtain a fast and accurate Centroid-Linkage HAC algorithm. Compared to an existing state-of-the-art exact baseline, our implementation maintains the clustering quality while delivering up to a $36\times$ speedup due to performing fewer distance comparisons.

Efficient Centroid-Linkage Clustering

TL;DR

This paper introduces a subquadratic, -approximate Centroid-Linkage HAC by integrating a meta-algorithm that leverages dynamic ANNS on cluster centroids with a dynamic ANNS robust to adaptive updates. It develops a black-box reduction from oblivious to adaptive ANNS, enabling a -approximate HAC with runtime near and proofs of correctness under adaptive updates. Empirically, the approach achieves clustering quality close to exact centroid HAC while delivering substantial speedups (up to on a single core and up to on many cores) across standard benchmarks, using DiskANN-based dynamic ANNS. The work offers practical and theoretical advances for scalable HAC in high dimensions and highlights open challenges in parallelizing approximate centroid HAC further.

Abstract

We give an efficient algorithm for Centroid-Linkage Hierarchical Agglomerative Clustering (HAC), which computes a -approximate clustering in roughly time. We obtain our result by combining a new Centroid-Linkage HAC algorithm with a novel fully dynamic data structure for nearest neighbor search which works under adaptive updates. We also evaluate our algorithm empirically. By leveraging a state-of-the-art nearest-neighbor search library, we obtain a fast and accurate Centroid-Linkage HAC algorithm. Compared to an existing state-of-the-art exact baseline, our implementation maintains the clustering quality while delivering up to a speedup due to performing fewer distance comparisons.
Paper Structure (23 sections, 9 theorems, 29 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 9 theorems, 29 equations, 10 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Suppose we are given $\gamma > 1$ and $c,\beta, \Delta$ and $n$ where $\log(\Delta / \beta), \gamma \leq \mathsf{poly}(n)$ and a dynamically updated set $S$ with at most $n$ insertions. Then, if all queries have true distance at most $\Delta$, we can compute a randomized $(O(c), \beta)$-approximate

Figures (10)

  • Figure 1: 3 points in $\mathbb{R}^2$ initially all at distance $1$ showing Centroid-Linkage HAC merge distances can decrease. \ref{['sfig:nonMon2']} / \ref{['sfig:nonMon3']} and \ref{['sfig:nonMon4']} / \ref{['sfig:nonMon5']} give the the first and second merges with distances $1$ and $\sqrt{3}/2 < 1$ respectively.
  • Figure 2: 3 points in $\mathbb{R}^2$ (two of which are initially at distance $1$) showing that $O(1)$-approximate Centroid-Linkage HAC can arbitrarily reduce merge distances. \ref{['sfig:apxMerge2']} / \ref{['sfig:apxMerge3']} and \ref{['sfig:apxMerge4']} / \ref{['sfig:apxMerge5']} give the the first and second merges with distances $1$ and $\epsilon \ll 1$ respectively; centroids are dashed circles.
  • Figure 3: Our merge-and-reduce strategy when a point (in green) is inserted.
  • Figure 4: Running times of fastcluster's centroid HAC, our implementation of exact centroid HAC, and the heap-based and bucket-based approximate centroid HAC with $\epsilon=0.1$. In \ref{['fig:plot_runtime_numpoints_single_thread']}, our approximate and exact implementations are run on 1 core, whereas in \ref{['fig:plot_runtime_numpoints']} they have access to 192 cores. \ref{['fig:plot_runtime_eps']} compares the running times of heap and bucket based algorithms as a function of $\epsilon$.
  • Figure 5: Non-Monotonicity of Centroid HAC: No. of $\delta$-inversions vs $\delta$
  • ...and 5 more figures

Theorems & Definitions (17)

  • Definition 1: Centroid
  • Definition 2: Dynamic Approximate Nearest-Neighbor Search (ANNS)
  • Theorem 1: Dynamic ANNS for Oblivious Updates, andoni2009nearesthar2012approximate
  • Theorem 2: Reduction of Dynamic ANNS from Oblivious to Adaptive
  • Theorem 3: Dynamic ANNS for Adaptive Updates
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • ...and 7 more