Table of Contents
Fetching ...

It's Hard to HAC with Average Linkage!

MohammadHossein Bateni, Laxman Dhulipala, Kishen N Gowda, D Ellis Hershkowitz, Rajesh Jayaram, Jakub Łącki

TL;DR

This work provides a nuanced complexity landscape for average linkage HAC. It proves strong conditional hardness results, including an $oldsymbol{ ext{Omega}}(n^{3/2-oldsymbol{bepsilon}})$ lower bound under the Combinatorial BMM conjecture and CC-hardness on diameter-4 trees, suggesting limited prospects for near-linear or NC algorithms in general. On structured inputs, it delivers constructive positive results: AL-HAC on paths lies in NC with polylogarithmic depth and near-linear work, and a general upper bound of $oldsymbol{O}(mh ext{log }n)$ shows efficiency when the output dendrogram height $oldsymbol{h}$ is small. The paper thus delineates where efficient parallelization is possible (paths, low-height dendrograms) and where it is likely infeasible (general graphs, even simple trees), providing a rigorous foundation for algorithm design in large-scale HAC applications.

Abstract

Average linkage Hierarchical Agglomerative Clustering (HAC) is an extensively studied and applied method for hierarchical clustering. Recent applications to massive datasets have driven significant interest in near-linear-time and efficient parallel algorithms for average linkage HAC. We provide hardness results that rule out such algorithms. On the sequential side, we establish a runtime lower bound of $n^{3/2-ε}$ on $n$ node graphs for sequential combinatorial algorithms under standard fine-grained complexity assumptions. This essentially matches the best-known running time for average linkage HAC. On the parallel side, we prove that average linkage HAC likely cannot be parallelized even on simple graphs by showing that it is CC-hard on trees of diameter $4$. On the possibility side, we demonstrate that average linkage HAC can be efficiently parallelized (i.e., it is in NC) on paths and can be solved in near-linear time when the height of the output cluster hierarchy is small.

It's Hard to HAC with Average Linkage!

TL;DR

This work provides a nuanced complexity landscape for average linkage HAC. It proves strong conditional hardness results, including an lower bound under the Combinatorial BMM conjecture and CC-hardness on diameter-4 trees, suggesting limited prospects for near-linear or NC algorithms in general. On structured inputs, it delivers constructive positive results: AL-HAC on paths lies in NC with polylogarithmic depth and near-linear work, and a general upper bound of shows efficiency when the output dendrogram height is small. The paper thus delineates where efficient parallelization is possible (paths, low-height dendrograms) and where it is likely infeasible (general graphs, even simple trees), providing a rigorous foundation for algorithm design in large-scale HAC applications.

Abstract

Average linkage Hierarchical Agglomerative Clustering (HAC) is an extensively studied and applied method for hierarchical clustering. Recent applications to massive datasets have driven significant interest in near-linear-time and efficient parallel algorithms for average linkage HAC. We provide hardness results that rule out such algorithms. On the sequential side, we establish a runtime lower bound of on node graphs for sequential combinatorial algorithms under standard fine-grained complexity assumptions. This essentially matches the best-known running time for average linkage HAC. On the parallel side, we prove that average linkage HAC likely cannot be parallelized even on simple graphs by showing that it is CC-hard on trees of diameter . On the possibility side, we demonstrate that average linkage HAC can be efficiently parallelized (i.e., it is in NC) on paths and can be solved in near-linear time when the height of the output cluster hierarchy is small.
Paper Structure (20 sections, 21 theorems, 10 equations, 7 figures, 5 algorithms)

This paper contains 20 sections, 21 theorems, 10 equations, 7 figures, 5 algorithms.

Key Result

Theorem 1

If average linkage HAC can be solved by a combinatorial algorithm in $O(n^{3/2-\epsilon})$ time for any $\epsilon > 0$, then the Combinatorial Boolean Matrix Multiplication (Combinatorial BMM) Conjecture is false.

Figures (7)

  • Figure 1: An example of average linkage HAC run on an input graph $G$. Edges labeled with weights. \ref{['sfig:hac1']} gives $G$. \ref{['sfig:hac2']} gives the cluster hierarchy output by HAC. \ref{['sfig:hac3']} gives the corresponding dendrogram with internal nodes labeled with the weight of their corresponding merge.
  • Figure 2: An example of average linkage HAC run on an input graph $G$ where we imagine we contract merged clusters. Intermediate vertices labeled with the vertices of $G$ their corresponding cluster contains. Edges labeled with their weight and next merged edge is dashed.
  • Figure 3: Our triangle detection reduction where we compute $G'$ from $G$ by adding $t=5$ nodes over $t$ rounds. \ref{['sfig:CBMMRed6']} gives $G'$. Each node labeled according to its round and corresponding vertex in $G$. Edges labelled with their weight in the round they are added (edges of $G$ have weight $1$). For the $i$th round we highlight in red $v_i$ and the edges added with weight $1/i-\epsilon$.
  • Figure 4: Adaptive Minimum on matrix $A$. The row considered in each step is shown in blue. $k_i$ for the $i$-th row written to the right of $A$ in green with witnessing entry of $A$ also in green. Indices removed from $I$ in relevant rows crossed out in red.
  • Figure 5: Reduction from LFM Matching to Adaptive Minimum. \ref{['sfig:aMinRed1']} gives the LFM Matching instance and \ref{['sfig:aMinRed2']} its solution. \ref{['sfig:aMinRed3']} gives the Adaptive Minimum instance from the reduction and \ref{['sfig:aMinRed4']} its solution.
  • ...and 2 more figures

Theorems & Definitions (32)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Conjecture 1: Combinatorial BMM
  • Theorem 6: Theorem 1.3 of williams2010subcubic
  • Theorem 7
  • Theorem 7
  • Theorem 7
  • ...and 22 more