FastEnsemble: scalable ensemble clustering on large networks
Yasamin Tabatabaee, Eleanor Wedell, Minhyuk Park, Tandy Warnow
TL;DR
FastEnsemble addresses the variability of stochastic and parameter-sensitive clustering by forming a consensus from multiple partitions via a co-classification matrix. It extends consensus clustering to modularity and CPM objectives, with a scalable workflow that prunes weakly supported edges and reclusters, and supports multi-method ensembles through weighted combinations. Across extensive synthetic experiments, FastEnsemble generally matches or exceeds the accuracy of ECG and FastConsensus and scales to networks with millions of nodes, though performance varies with mixing, density, and the chosen objective. The study also demonstrates that consensus clustering can mitigate the resolution limit for both modularity and CPM, and that Strict Consensus offers strong performance under extreme partitioning scenarios. These results suggest FastEnsemble as a practical, scalable tool for robust community detection on very large networks.
Abstract
Many community detection algorithms are inherently stochastic, leading to variations in their output depending on input parameters and random seeds. This variability makes the results of a single run of these algorithms less reliable. Moreover, different clustering algorithms, optimization criteria (e.g., modularity, the Constant Potts model), and resolution values can result in substantially different partitions on the same network. Consensus clustering methods, such as ECG and FastConsensus, have been proposed to reduce the instability of non-deterministic algorithms and improve their accuracy by combining a set of partitions resulting from multiple runs of a clustering algorithm. In this work, we introduce FastEnsemble, a new consensus clustering method. Our results on a wide range of synthetic networks show that FastEnsemble produces more accurate clusterings than two other consensus clustering methods, ECG and FastConsensus, for many model conditions. Furthermore, FastEnsemble is fast enough to be used on networks with more than 3 million nodes, and so improves on the speed and scalability of FastConsensus. Finally, we showcase the utility of consensus clustering methods in mitigating the effect of resolution limit and clustering networks that are only partially covered by communities.
