TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

Laxman Dhulipala; Jason Lee; Jakub Łącki; Vahab Mirrokni

TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

Laxman Dhulipala, Jason Lee, Jakub Łącki, Vahab Mirrokni

TL;DR

The paper addresses scalable high-quality hierarchical agglomerative clustering (HAC) for graphs with trillions of edges by introducing $\mathsf{TeraHAC}$, a distributed $(1+\epsilon)$-approximate HAC that leverages the nearest-neighbor chain paradigm and a new notion of $\,(1+\epsilon)$-good merges to enable aggressive parallelism. It proves that performing $\,(1+\epsilon)$-good merges yields a $$(1+\epsilon)$$-approximate dendrogram, with merge order being flexible enough to allow interleaving and distributed execution. The algorithm partitions the input graph into subgraphs, runs a local $\mathsf{SubgraphHAC}$ on each, and then merges results, augmented by vertex pruning and a dendrogram-flattening step; the key subroutine runs in $O((m+n)\log^2 n)$. Empirically, $\mathsf{TeraHAC}$ scales to graphs with up to 8 trillion edges, achieves over 100x fewer rounds than prior methods, and is up to 8.3x faster than $\mathsf{SCC}$ with about 1.16x higher quality, effectively preserving HAC quality while dramatically improving runtime on massive graphs.

Abstract

We introduce TeraHAC, a $(1+ε)$-approximate hierarchical agglomerative clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm is based on a new approach to computing $(1+ε)$-approximate HAC, which is a novel combination of the nearest-neighbor chain algorithm and the notion of $(1+ε)$-approximate HAC. Our approach allows us to partition the graph among multiple machines and make significant progress in computing the clustering within each partition before any communication with other partitions is needed. We evaluate TeraHAC on a number of real-world and synthetic graphs of up to 8 trillion edges. We show that TeraHAC requires over 100x fewer rounds compared to previously known approaches for computing HAC. It is up to 8.3x faster than SCC, the state-of-the-art distributed algorithm for hierarchical clustering, while achieving 1.16x higher quality. In fact, TeraHAC essentially retains the quality of the celebrated HAC algorithm while significantly improving the running time.

TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

TL;DR

The paper addresses scalable high-quality hierarchical agglomerative clustering (HAC) for graphs with trillions of edges by introducing

, a distributed

-approximate HAC that leverages the nearest-neighbor chain paradigm and a new notion of

-good merges to enable aggressive parallelism. It proves that performing

-good merges yields a

-approximate dendrogram, with merge order being flexible enough to allow interleaving and distributed execution. The algorithm partitions the input graph into subgraphs, runs a local

on each, and then merges results, augmented by vertex pruning and a dendrogram-flattening step; the key subroutine runs in

. Empirically,

scales to graphs with up to 8 trillion edges, achieves over 100x fewer rounds than prior methods, and is up to 8.3x faster than

with about 1.16x higher quality, effectively preserving HAC quality while dramatically improving runtime on massive graphs.

Abstract

We introduce TeraHAC, a

-approximate hierarchical agglomerative clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm is based on a new approach to computing

-approximate HAC, which is a novel combination of the nearest-neighbor chain algorithm and the notion of

-approximate HAC. Our approach allows us to partition the graph among multiple machines and make significant progress in computing the clustering within each partition before any communication with other partitions is needed. We evaluate TeraHAC on a number of real-world and synthetic graphs of up to 8 trillion edges. We show that TeraHAC requires over 100x fewer rounds compared to previously known approaches for computing HAC. It is up to 8.3x faster than SCC, the state-of-the-art distributed algorithm for hierarchical clustering, while achieving 1.16x higher quality. In fact, TeraHAC essentially retains the quality of the celebrated HAC algorithm while significantly improving the running time.

Paper Structure (7 sections, 10 theorems, 1 equation, 4 figures, 3 algorithms)

This paper contains 7 sections, 10 theorems, 1 equation, 4 figures, 3 algorithms.

Introduction
Our Contribution
Further Related Work
Preliminaries
Approximate Nearest-Neighbor Chain Algorithm
$\mathsf{TeraHAC}$ algorithm
Flattening the Dendrogram

Key Result

Lemma 1

Let $G_1, \ldots, G_n$ be a sequence of graphs, in which each graph is obtained from the previous one by performing an arbitrary merge. Let $v$ be a vertex (cluster) which exists in $G_l, \ldots, G_r$. Let $w_{\max}^i(v)$ be the value of $w_{\max}(v)$ in $G_i$, where $l \leq i \leq r$. Then $w_{\max

Figures (4)

Figure 1: Comparison of the merges available at the start of the algorithm for different parallel HAC algorithms.
Figure 2: Number of rounds used by $\mathsf{TeraHAC}$ compared with $\mathsf{OptimizedRAC}$ ($\mathsf{TeraHAC}$ using $\epsilon=0$), $\mathsf{ParHAC}$ and $\mathsf{RAC}$ on four large real-world graph datasets. All algorithms use a weight threshold of $t=0.01$ (see Section \ref{['sec:experiments']}).
Figure 3: Distributed running times of $\mathsf{TeraHAC}$ compared with $\mathsf{OptimizedRAC}$ ($\mathsf{TeraHAC}$ using $\epsilon=0$) on the same graphs and threshold as Figure \ref{['fig:terahac_vs_parhac_rac_rounds']}.
Figure 4: Example showing the need for $M(\cdot)$ values in Definition \ref{['def:good']}. Green edges correspond to merges which are $(1+\epsilon)$-good, and red edges correspond to merges which are not $(1+\epsilon)$-good. After merging $ab$ (which is $(1+\epsilon)$-good) we obtain a vertex $\{a, b\}$ such that $M(\{a, b\}) = 1$. Therefore, the merge of $\{a, b\}$ with $c$ in the resulting graph is not$(1+\epsilon)$-good, since $\max(1+\epsilon, (1+\epsilon)^2) / \min(1, \infty, 1+\epsilon) = (1+\epsilon)^2 > 1+\epsilon$. Hence, the algorithm is forced to merge $c$ with $d$. It is easy to see that allowing a merge of $\{a, b\}$ with $c$ would create a dendrogram, which is not $(1+\epsilon)$-approximate.

Theorems & Definitions (13)

Definition 1: benzecri1982construction
Definition 2: Good merge
Lemma 1
Lemma 2
Definition 3
Lemma 3
Lemma 4
Theorem 1
Lemma 5
Lemma 6
...and 3 more

TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

TL;DR

Abstract

TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (13)