TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs
Laxman Dhulipala, Jason Lee, Jakub Łącki, Vahab Mirrokni
TL;DR
The paper addresses scalable high-quality hierarchical agglomerative clustering (HAC) for graphs with trillions of edges by introducing $\mathsf{TeraHAC}$, a distributed $(1+\epsilon)$-approximate HAC that leverages the nearest-neighbor chain paradigm and a new notion of $\,(1+\epsilon)$-good merges to enable aggressive parallelism. It proves that performing $\,(1+\epsilon)$-good merges yields a $$(1+\epsilon)$$-approximate dendrogram, with merge order being flexible enough to allow interleaving and distributed execution. The algorithm partitions the input graph into subgraphs, runs a local $\mathsf{SubgraphHAC}$ on each, and then merges results, augmented by vertex pruning and a dendrogram-flattening step; the key subroutine runs in $O((m+n)\log^2 n)$. Empirically, $\mathsf{TeraHAC}$ scales to graphs with up to 8 trillion edges, achieves over 100x fewer rounds than prior methods, and is up to 8.3x faster than $\mathsf{SCC}$ with about 1.16x higher quality, effectively preserving HAC quality while dramatically improving runtime on massive graphs.
Abstract
We introduce TeraHAC, a $(1+ε)$-approximate hierarchical agglomerative clustering (HAC) algorithm which scales to trillion-edge graphs. Our algorithm is based on a new approach to computing $(1+ε)$-approximate HAC, which is a novel combination of the nearest-neighbor chain algorithm and the notion of $(1+ε)$-approximate HAC. Our approach allows us to partition the graph among multiple machines and make significant progress in computing the clustering within each partition before any communication with other partitions is needed. We evaluate TeraHAC on a number of real-world and synthetic graphs of up to 8 trillion edges. We show that TeraHAC requires over 100x fewer rounds compared to previously known approaches for computing HAC. It is up to 8.3x faster than SCC, the state-of-the-art distributed algorithm for hierarchical clustering, while achieving 1.16x higher quality. In fact, TeraHAC essentially retains the quality of the celebrated HAC algorithm while significantly improving the running time.
