Table of Contents
Fetching ...

Adaptive cut reveals multiscale complexity in networks

Louis Boucherie, Yong-Yeol Ahn, Sune Lehmann

TL;DR

The paper tackles the limitation of single-level cuts in hierarchical clustering by introducing an adaptive cut that uses multi-level dendrogram cuts optimized via Markov chain Monte Carlo with simulated annealing. It couples this approach with a new balancedness metric, B, based on entropy that predicts when multi-level cuts will outperform single cuts. Across synthetic and real networks, including extension to Louvain to produce full dendrograms, the adaptive cut improves partition density and modularity, especially in unbalanced trees, and proves broadly applicable to various clustering tasks. The work provides code, formal definitions, and proofs, offering a robust, adaptable tool for multiscale clustering in networks and beyond.

Abstract

Hierarchical clustering and community detection are important problems in machine learning and complex network analysis. A common approach to identify clusters is to simply cut dendrograms at some threshold. However, single-level cuts are often suboptimal in terms of capturing underlying structure in the data, especially when the dendrogram is unbalanced. In this paper, we present the adaptive cut, a novel method that leverages the hierarchical structure of dendrograms by employing multi-level cuts to overcome the limitations of single-level approaches. The adaptive cut optimizes an objective function using a Markov chain Monte Carlo with simulated annealing, resulting in better partitions. We demonstrate the effectiveness of the adaptive cut through applications to link clustering and modularity optimization, but note that the method is applicable to any clustering task that relies on a dendrogram and an objective function. Beyond the adaptive cut, we introduce the balancedness score, an information-theoretic metric that quantifies how balanced a dendrogram is. Balancedness predicts the potential benefits of using multi-level cuts. For the community detection examples, we evaluate our method on more than 200 real-world networks and multiple synthetic datasets, demonstrating significant improvements in partition density and modularity over traditional single-cut approaches. In addition, we show the generality of the adaptive cut by applying it across various hierarchical clustering techniques and objective functions. Our results indicate that the adaptive cut provides a robust and versatile tool for improving clustering outcomes.

Adaptive cut reveals multiscale complexity in networks

TL;DR

The paper tackles the limitation of single-level cuts in hierarchical clustering by introducing an adaptive cut that uses multi-level dendrogram cuts optimized via Markov chain Monte Carlo with simulated annealing. It couples this approach with a new balancedness metric, B, based on entropy that predicts when multi-level cuts will outperform single cuts. Across synthetic and real networks, including extension to Louvain to produce full dendrograms, the adaptive cut improves partition density and modularity, especially in unbalanced trees, and proves broadly applicable to various clustering tasks. The work provides code, formal definitions, and proofs, offering a robust, adaptable tool for multiscale clustering in networks and beyond.

Abstract

Hierarchical clustering and community detection are important problems in machine learning and complex network analysis. A common approach to identify clusters is to simply cut dendrograms at some threshold. However, single-level cuts are often suboptimal in terms of capturing underlying structure in the data, especially when the dendrogram is unbalanced. In this paper, we present the adaptive cut, a novel method that leverages the hierarchical structure of dendrograms by employing multi-level cuts to overcome the limitations of single-level approaches. The adaptive cut optimizes an objective function using a Markov chain Monte Carlo with simulated annealing, resulting in better partitions. We demonstrate the effectiveness of the adaptive cut through applications to link clustering and modularity optimization, but note that the method is applicable to any clustering task that relies on a dendrogram and an objective function. Beyond the adaptive cut, we introduce the balancedness score, an information-theoretic metric that quantifies how balanced a dendrogram is. Balancedness predicts the potential benefits of using multi-level cuts. For the community detection examples, we evaluate our method on more than 200 real-world networks and multiple synthetic datasets, demonstrating significant improvements in partition density and modularity over traditional single-cut approaches. In addition, we show the generality of the adaptive cut by applying it across various hierarchical clustering techniques and objective functions. Our results indicate that the adaptive cut provides a robust and versatile tool for improving clustering outcomes.

Paper Structure

This paper contains 27 sections, 24 equations, 6 figures.

Figures (6)

  • Figure 1: Explanation of the balancedness Measure.(a, b, c) Illustrations of different tree structures: (a) an unbalanced caterpillar tree, (b) an intermediate tree, and (c) a balanced tree. (d) Dendrogram of the network of the urban street of Paris based on link similarities ahn2010link. The dendrogram is unbalanced, as shown in (e). (e) The progression of the real, maximal and minimal entropies (x axis) across different similarity levels (y axis). The three entropies are used to compute the balancedness metric (Eq. \ref{['eq:balancedness']}). (f) Dendrogram representing the "Les Miserables" character network based on link similarities ahn2010link. The dendrogram is unbalanced, as shown in (g). (g) The progression of the real, maximal and minimal entropies across different levels. (h) The distribution of balancedness scores for 200 real networks. (i) A plot of the balancedness metric against network size (number of nodes), demonstrating that the balancedness score is relatively independent of network size.
  • Figure 2: Comparison between Link Clustering and adaptive cut(a) The adjacency matrix of a stochastic block model network, with nodes colored according to edge communities identified by the link clustering method ahn2010link. (b) The corresponding dendrogram for the same network, with partitions or communities defined by the single-level link clustering cut ahn2010link. The similarity level at which the cut is made is indicated by the dashed line. (c, d) Similar network to (a) and (b), but using an adaptive cut method instead of link clustering.
  • Figure 3: Varying Density Stochastic Block Model Network: Comparison of average cluster size and recovery accuracy for a. (a) Comparison of the average cluster size (log-scale) for Link Clustering, the adaptive cut, and the ground truth. Link Clustering has many singleton communities that yield a smaller average size, whereas the adaptive cut merges the singletons into clusters and therefore the average cluster size matches the ground truth better. (b) Adjusted Mutual information between the detected partitions and the ground truth assignments. The adaptive cut has higher mutual information with ground truth than Link Clustering, indicating a better community detection.
  • Figure 4: Improvement (in %) of the partition density between the single-level cut (Link Clustering ahn2010link) and the multi-level adaptive cut as a function of the dendrogram balancedness. The color of the symbols indicates the domain of the network, and the size indicates the number of nodes in the network.
  • Figure 5: Improvement (in %) of the modularity between the single-level cut (Louvain) and the multi-level cut (Louvain + adaptive cut) as a function of balancedness. The symbol colors indicate the domain of the network, and the size the number of nodes in the network.
  • ...and 1 more figures