Table of Contents
Fetching ...

DynHAC: Fully Dynamic Approximate Hierarchical Agglomerative Clustering

Shangdi Yu, Laxman Dhulipala, Jakub Łącki, Nikos Parotsidis

TL;DR

This work tackles dynamic average-linkage HAC by introducing DynHAC, a fully dynamic algorithm that maintains a $1+\epsilon$ approximate dendrogram under point insertions and deletions. It relies on a partitioned, SubgraphHAC-based approach inspired by TeraHAC to update only affected portions of the clustering, ensuring provable approximation guarantees. Empirically, DynHAC achieves up to $423\times$ speedups over recomputing from scratch and up to $0.21$ higher NMI than state-of-the-art dynamic HAC baselines that do not guarantee approximation, on real-world graphs. The method provides practical, scalable dynamic clustering with publicly released code.

Abstract

We consider the problem of maintaining a hierarchical agglomerative clustering (HAC) in the dynamic setting, when the input is subject to point insertions and deletions. We introduce DynHAC - the first dynamic HAC algorithm for the popular average-linkage version of the problem which can maintain a 1+εapproximate solution. Our approach leverages recent structural results on (1+ε)-approximate HAC to carefully identify the part of the clustering dendrogram that needs to be updated in order to produce a solution that is consistent with what a full recomputation from scratch would have output. We evaluate DynHAC on a number of real-world graphs. We show that DynHAC can handle each update up to 423x faster than what it would take to recompute the clustering from scratch. At the same time it achieves up to 0.21 higher NMI score than the state-of-the-art dynamic hierarchical clustering algorithms, which do not provably approximate HAC.

DynHAC: Fully Dynamic Approximate Hierarchical Agglomerative Clustering

TL;DR

This work tackles dynamic average-linkage HAC by introducing DynHAC, a fully dynamic algorithm that maintains a approximate dendrogram under point insertions and deletions. It relies on a partitioned, SubgraphHAC-based approach inspired by TeraHAC to update only affected portions of the clustering, ensuring provable approximation guarantees. Empirically, DynHAC achieves up to speedups over recomputing from scratch and up to higher NMI than state-of-the-art dynamic HAC baselines that do not guarantee approximation, on real-world graphs. The method provides practical, scalable dynamic clustering with publicly released code.

Abstract

We consider the problem of maintaining a hierarchical agglomerative clustering (HAC) in the dynamic setting, when the input is subject to point insertions and deletions. We introduce DynHAC - the first dynamic HAC algorithm for the popular average-linkage version of the problem which can maintain a 1+εapproximate solution. Our approach leverages recent structural results on (1+ε)-approximate HAC to carefully identify the part of the clustering dendrogram that needs to be updated in order to produce a solution that is consistent with what a full recomputation from scratch would have output. We evaluate DynHAC on a number of real-world graphs. We show that DynHAC can handle each update up to 423x faster than what it would take to recompute the clustering from scratch. At the same time it achieves up to 0.21 higher NMI score than the state-of-the-art dynamic hierarchical clustering algorithms, which do not provably approximate HAC.
Paper Structure (14 sections, 4 theorems, 1 equation, 13 figures, 3 tables, 5 algorithms)

This paper contains 14 sections, 4 theorems, 1 equation, 13 figures, 3 tables, 5 algorithms.

Key Result

Lemma 2.1

Any dendrogram produced by a sequence of $(1+\epsilon)$-good merges is $(1+\epsilon)$ approximate.

Figures (13)

  • Figure 1: Quality of clustering algorithm.
  • Figure 2: Running times.
  • Figure 5: Update speedup over $\epsilon=0$ and NMI of the last 1% insertions on data sets with different $\epsilon$ values. Deletions are similar.
  • Figure 6: Analysis of $\mathsf{DynHAC}$ on MNIST.
  • Figure 7: Running time and quality on ALOI for static HAC and our $\mathsf{DynHAC}$ insertion and deletion, and GINRCH insertion and deletion.
  • ...and 8 more figures

Theorems & Definitions (8)

  • Definition 1: Partition subgraph
  • Definition 2: Good mergeterahac
  • Lemma 2.1
  • Definition 3: Partition id
  • Lemma 3.1
  • Definition 4
  • Lemma C .1
  • Lemma D .1