Table of Contents
Fetching ...

FastEnsemble: scalable ensemble clustering on large networks

Yasamin Tabatabaee, Eleanor Wedell, Minhyuk Park, Tandy Warnow

TL;DR

FastEnsemble addresses the variability of stochastic and parameter-sensitive clustering by forming a consensus from multiple partitions via a co-classification matrix. It extends consensus clustering to modularity and CPM objectives, with a scalable workflow that prunes weakly supported edges and reclusters, and supports multi-method ensembles through weighted combinations. Across extensive synthetic experiments, FastEnsemble generally matches or exceeds the accuracy of ECG and FastConsensus and scales to networks with millions of nodes, though performance varies with mixing, density, and the chosen objective. The study also demonstrates that consensus clustering can mitigate the resolution limit for both modularity and CPM, and that Strict Consensus offers strong performance under extreme partitioning scenarios. These results suggest FastEnsemble as a practical, scalable tool for robust community detection on very large networks.

Abstract

Many community detection algorithms are inherently stochastic, leading to variations in their output depending on input parameters and random seeds. This variability makes the results of a single run of these algorithms less reliable. Moreover, different clustering algorithms, optimization criteria (e.g., modularity, the Constant Potts model), and resolution values can result in substantially different partitions on the same network. Consensus clustering methods, such as ECG and FastConsensus, have been proposed to reduce the instability of non-deterministic algorithms and improve their accuracy by combining a set of partitions resulting from multiple runs of a clustering algorithm. In this work, we introduce FastEnsemble, a new consensus clustering method. Our results on a wide range of synthetic networks show that FastEnsemble produces more accurate clusterings than two other consensus clustering methods, ECG and FastConsensus, for many model conditions. Furthermore, FastEnsemble is fast enough to be used on networks with more than 3 million nodes, and so improves on the speed and scalability of FastConsensus. Finally, we showcase the utility of consensus clustering methods in mitigating the effect of resolution limit and clustering networks that are only partially covered by communities.

FastEnsemble: scalable ensemble clustering on large networks

TL;DR

FastEnsemble addresses the variability of stochastic and parameter-sensitive clustering by forming a consensus from multiple partitions via a co-classification matrix. It extends consensus clustering to modularity and CPM objectives, with a scalable workflow that prunes weakly supported edges and reclusters, and supports multi-method ensembles through weighted combinations. Across extensive synthetic experiments, FastEnsemble generally matches or exceeds the accuracy of ECG and FastConsensus and scales to networks with millions of nodes, though performance varies with mixing, density, and the chosen objective. The study also demonstrates that consensus clustering can mitigate the resolution limit for both modularity and CPM, and that Strict Consensus offers strong performance under extreme partitioning scenarios. These results suggest FastEnsemble as a practical, scalable tool for robust community detection on very large networks.

Abstract

Many community detection algorithms are inherently stochastic, leading to variations in their output depending on input parameters and random seeds. This variability makes the results of a single run of these algorithms less reliable. Moreover, different clustering algorithms, optimization criteria (e.g., modularity, the Constant Potts model), and resolution values can result in substantially different partitions on the same network. Consensus clustering methods, such as ECG and FastConsensus, have been proposed to reduce the instability of non-deterministic algorithms and improve their accuracy by combining a set of partitions resulting from multiple runs of a clustering algorithm. In this work, we introduce FastEnsemble, a new consensus clustering method. Our results on a wide range of synthetic networks show that FastEnsemble produces more accurate clusterings than two other consensus clustering methods, ECG and FastConsensus, for many model conditions. Furthermore, FastEnsemble is fast enough to be used on networks with more than 3 million nodes, and so improves on the speed and scalability of FastConsensus. Finally, we showcase the utility of consensus clustering methods in mitigating the effect of resolution limit and clustering networks that are only partially covered by communities.
Paper Structure (8 sections, 3 equations, 8 figures, 1 table)

This paper contains 8 sections, 3 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Experiment 1a: Setting the default value for $t$ in FastEnsemble. Each plot shows ARI and NMI accuracy for Leiden-mod and FastEnsemble using four different threshold values on the default algorithm design networks with 10,000 nodes. Left: Accuracy as a function of the model mixing parameter (x-axis). Right: Accuracy as a function of the threshold value on the networks with model mixing parameter $0.5$. FE stands for FastEnsemble.
  • Figure 2: Experiment 1b: Evaluating modularity-based consensus clustering pipelines on the algorithm design datasets with 10,000 nodes as a function of the mixing parameter. Results are shown for three consensus clustering methods and also Leiden-mod on the algorithm design datasets with 10,000 nodes but varying mixing parameter (values on the x-axis).
  • Figure 3: Experiment 2: Evaluating modularity-based consensus clustering pipelines on synthetic networks based on clustered real-world networks. Results are for modularity-based clustering methods on LFR networks from park2024well-journal, each based on a Leiden-modularity clustering of a real-world network. Left: Accuracy (NMI and ARI). Right: Runtime (in hours). FastConsensus failed to converge on three networks (CEN, open_citations, cit_patents) within the allotted 48 hours.
  • Figure 4: Experiment 3: Comparison of FastEnsemble(Leiden-CPM) and Leiden-CPM on synthetic networks based on clustered real-world networks. The LFR networks are from park2024well-journal and are generated from a real-world network clustered using Leiden optimizing CPM for a specific resolution parameter value. The clustering methods studied are Leiden-CPM and FastEnsemble using CPM, each used with the same resolution parameter value as specified for the given LFR network. Top: Accuracy (NMI and ARI). Bottom: Runtime (in minutes). Results are not shown for three conditions: LFR graphs with a large fraction of disconnected ground truth clusters (the two CEN networks) or when the LFR software failed to create a network for the provided parameters (the wiki_topcats network).
  • Figure 5: Experiment 4: Accuracy of modularity-based consensus clustering methods on ring-of-cliques networks of varying sizes. Each ring-of-cliques networks connects $n$ cliques of size 10 in a ring. The methods compared are Leiden-mod, ECG, FastConsensus, FastEnsemble, and Strict Consensus (with two numbers np of partitions). Top left: Accuracy (ARI, NMI, F1-score) as a function of $n$. Top right: Error metrics (FNR and FPR) as a function of $n$. Bottom: Cluster size distribution as a function of $n$ (the dotted line indicates the true distribution).
  • ...and 3 more figures