When Does Bottom-up Beat Top-down in Hierarchical Community Detection?

Maximilien Dreveton; Daichi Kuroda; Matthias Grossglauser; Patrick Thiran

When Does Bottom-up Beat Top-down in Hierarchical Community Detection?

Maximilien Dreveton, Daichi Kuroda, Matthias Grossglauser, Patrick Thiran

TL;DR

The paper analyzes hierarchical community detection under Hierarchical Stochastic Block Models, contrasting bottom-up (agglomerative) and top-down (divisive) algorithms. It proves that bottom-up linkage with average-linkage can recover the latent tree under sparse conditions $N\delta_N=\omega(1)$ and can attain exact recovery at intermediate depths up to the information-theoretic threshold, outperforming top-down methods whose guarantees are stricter. A two-stage algorithm combines Bethe-Hessian spectral initialization to identify primitive communities with an edge-density based linkage step, and is shown to be robust to misclustering, including in bounded-degree settings via graph-splitting. Numerical experiments on synthetic BTSBMs and real networks (e.g., high-school contact networks and power grids) corroborate the theoretical advantages, including fewer dendrogram inversions in bottom-up trees. Collectively, the results advance understanding of hierarchical clustering for networks and expand the practical regime where exact hierarchical recovery is feasible.

Abstract

Hierarchical clustering of networks consists in finding a tree of communities, such that lower levels of the hierarchy reveal finer-grained community structures. There are two main classes of algorithms tackling this problem. Divisive (top-down) algorithms recursively partition the nodes into two communities, until a stopping rule indicates that no further split is needed. In contrast, agglomerative (bottom-up) algorithms first identify the smallest community structure and then repeatedly merge the communities using a linkage method. In this article, we establish theoretical guarantees for the recovery of the hierarchical tree and community structure of a Hierarchical Stochastic Block Model by a bottom-up algorithm. We also establish that this bottom-up algorithm attains the information-theoretic threshold for exact recovery at intermediate levels of the hierarchy. Notably, these recovery conditions are less restrictive compared to those existing for top-down algorithms. This shows that bottom-up algorithms extend the feasible region for achieving exact recovery at intermediate levels. Numerical experiments on both synthetic and real data sets confirm the superiority of bottom-up algorithms over top-down algorithms. We also observe that top-down algorithms can produce dendrograms with inversions. These findings contribute to a better understanding of hierarchical clustering techniques and their applications in network analysis.

When Does Bottom-up Beat Top-down in Hierarchical Community Detection?

TL;DR

and can attain exact recovery at intermediate depths up to the information-theoretic threshold, outperforming top-down methods whose guarantees are stricter. A two-stage algorithm combines Bethe-Hessian spectral initialization to identify primitive communities with an edge-density based linkage step, and is shown to be robust to misclustering, including in bounded-degree settings via graph-splitting. Numerical experiments on synthetic BTSBMs and real networks (e.g., high-school contact networks and power grids) corroborate the theoretical advantages, including fewer dendrogram inversions in bottom-up trees. Collectively, the results advance understanding of hierarchical clustering for networks and expand the practical regime where exact hierarchical recovery is feasible.

Abstract

Paper Structure (45 sections, 10 theorems, 87 equations, 15 figures, 1 algorithm)

This paper contains 45 sections, 10 theorems, 87 equations, 15 figures, 1 algorithm.

Introduction
Notations
Hierarchical Community Detection
Divisive (top-down) Algorithms
Agglomerative (bottom-up) Algorithms
Tree Recovery from the Bottom
Hierarchical Stochastic Block Model
Tree Recovery with Growing Average Degree
Tree Recovery with Bounded Average Degree
Exact Recovery at Intermediate Depths
Chernoff-Hellinger Divergence
Exact recovery at Intermediate Depths
Discussion
Previous Work on Exact Recovery in HSBM
Top-Down HCD and Exact Recovery at Intermediate Depth
...and 30 more sections

Key Result

Theorem 1

Consider an assortative HSBM. Suppose that Assumptions assumption:fixed_quantites and assumption:scalings hold, with $N \delta_N = \omega( 1 )$. Let $\widehat{\mathcal{C}}$ be an estimator of $\mathcal{C}$, possibly correlated with the graph edges and such that $| \widehat{\mathcal{C}} | = |\mathcal

Figures (15)

Figure 1: Examples of (a) an HSBM and (b) a BTSBM, with the binary string representation of each node. The link probabilities are $p(u)$ for the HSBM and $p_{|u|}$ for the BTSBM. The grey-colored rectangles represent the super-communities.
Figure 2: Performance of bottom-up and top-down algorithms on BTSBMs of depth 3, $N = 3200$ nodes, and interaction probabilities $p_k = a_k \log N / N$, where $a_0 = 40$ and $a_3 = 100$, as a function of $a_1$ and $a_2$. We vary $a_1 \le a_2$ from $a_0$ to $a_3$. The empirical performance of the algorithms is measured by the accuracy at each depth, given by the color scale (results are averaged over 10 realizations). Large circles represent exact recovery (i.e., perfect accuracy on each of the 10 runs), and small crosses represent a non-exact recovery. The colored solid lines delimit the theoretical exact recovery thresholds for each algorithm on the various depths (given by Equations \ref{['eq:conditions_intermediate_exact_recovery_top-down']}-\ref{['eq:conditions_intermediate_exact_recovery_bottom-up']}); for a given depth $q$, these equations provide a single condition for bottom-up, but $q$ conditions for top-down. At depths 1 and 2, the regimes where exact recovery can be achieved are the areas above the solid line(s). At depth 3, the area lies below the threshold drawn by the red line (and above the blue and green lines for top-down; this area forms a small triangle).
Figure 3: BTSBMs with depth 3, $N = 1600$, and $\beta = 0.3$. Figures \ref{['fig:bounded_confusion1']} and \ref{['fig:bounded_confusion2']} show two confusion matrices when the expected degree equals 5 and 10. Figure \ref{['fig:bounded_hzeta']} shows the evolution of $\zeta$ as a function of the expected degree. Figure \ref{['fig:bounded_treeRecovery']} shows the tree recovery success rate with and without graph splitting. Results of Figures \ref{['fig:bounded_hzeta']} and \ref{['fig:bounded_treeRecovery']} are averaged over 200 realizations.
Figure 4: Bottom-up and top-down algorithms on the high school data set. Nodes correspond to the students, colors to the true classes, and edges of the graph are in grey. The hierarchical tree is drawn in black, and its root is marked by a star symbol.
Figure 5: Bottom-up algorithm on the power-grid network.
...and 10 more figures

Theorems & Definitions (25)

Definition 1
Theorem 1
Definition 2: Graph-splitting
Theorem 2
Example 1
Example 2
Example 3
Theorem 3
Lemma 1
proof : Proof of Theorem \ref{['thm:performance_bottomUp']}
...and 15 more

When Does Bottom-up Beat Top-down in Hierarchical Community Detection?

TL;DR

Abstract

When Does Bottom-up Beat Top-down in Hierarchical Community Detection?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (25)