Table of Contents
Fetching ...

Improved Community Detection using Stochastic Block Models

Minhyuk Park, Daniel Wang Feng, Siya Digra, The-Anh Vu-Le, Lahari Anne, George Chacko, Tandy Warnow

TL;DR

The paper investigates edge connectivity in community detection with Stochastic Block Models (SBMs), revealing that SBMs frequently yield internally disconnected communities on real networks. It introduces Well-Connected Clusters (WCC), a post-processing method that iteratively removes small edge cuts to enforce well-connectedness, and compares it to Connectivity Modifier (CM) and simple Connected Components (CC). Across large-scale synthetic networks (LFR and RECCS) and real networks, SBM+WCC generally improves clustering accuracy (ARI/NMI/AGRI/RMI) while remaining scalable to networks with millions of nodes, whereas CM shows mixed effects and CC can reduce node coverage. The authors further explain why Degree Corrected SBM drives disconnections and why WCC outperforms CM, linking behavior to the description-length objective, and provide an open-source implementation for practical use.

Abstract

Identifying edge-dense communities that are also well-connected is an important aspect of understanding community structure. Prior work has shown that community detection methods can produce poorly connected communities, and some can even produce internally disconnected communities. In this study we evaluate the connectivity of communities obtained using Stochastic Block Models. We find that SBMs produce internally disconnected communities from real-world networks. We present a simple technique, Well-Connected Clusters (WCC), which repeatedly removes small edge cuts until the communities meet a user-specified threshold for well-connectivity. Our study using a large collection of synthetic networks based on clustered real-world networks shows that using WCC as a post-processing tool with SBM community detection typically improves clustering accuracy. WCC is fast enough to use on networks with millions of nodes and is freely available in open source form.

Improved Community Detection using Stochastic Block Models

TL;DR

The paper investigates edge connectivity in community detection with Stochastic Block Models (SBMs), revealing that SBMs frequently yield internally disconnected communities on real networks. It introduces Well-Connected Clusters (WCC), a post-processing method that iteratively removes small edge cuts to enforce well-connectedness, and compares it to Connectivity Modifier (CM) and simple Connected Components (CC). Across large-scale synthetic networks (LFR and RECCS) and real networks, SBM+WCC generally improves clustering accuracy (ARI/NMI/AGRI/RMI) while remaining scalable to networks with millions of nodes, whereas CM shows mixed effects and CC can reduce node coverage. The authors further explain why Degree Corrected SBM drives disconnections and why WCC outperforms CM, linking behavior to the description-length objective, and provide an open-source implementation for practical use.

Abstract

Identifying edge-dense communities that are also well-connected is an important aspect of understanding community structure. Prior work has shown that community detection methods can produce poorly connected communities, and some can even produce internally disconnected communities. In this study we evaluate the connectivity of communities obtained using Stochastic Block Models. We find that SBMs produce internally disconnected communities from real-world networks. We present a simple technique, Well-Connected Clusters (WCC), which repeatedly removes small edge cuts until the communities meet a user-specified threshold for well-connectivity. Our study using a large collection of synthetic networks based on clustered real-world networks shows that using WCC as a post-processing tool with SBM community detection typically improves clustering accuracy. WCC is fast enough to use on networks with millions of nodes and is freely available in open source form.

Paper Structure

This paper contains 7 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Experiment 1: Cluster connectivity of SBM on real-world networks. (A) 85 non-bipartite graphs, (B) 35 bipartite graphs. SBM often produces poorly-connected and disconnected clusters, with greater tendency on bipartite graphs. The figure shows the proportion of well-connected (green), poorly connected (blue), and disconnected clusters (orange) in the output clusterings of the lowest description length SBM. The x-axis shows the different real-world networks ordered by number of nodes. White bars separate small, medium, and large networks.
  • Figure 2: Experiment 1: Node coverage of SBM+CC clusterings (A) Node coverage on 85 non-bipartite networks, (B) Node coverage on 35 bipartite networks. The node coverage of SBM treated with CC is much higher on non-bipartite graphs than on bipartite graphs.
  • Figure 3: Experiment 3a: Impact of treatment on NMI/ARI/AGRI/RMI scores of selected SBM on LFR and RECCS networks (heatmap) WCC exceeds CC, and CM treatments in improving the clustering accuracy of SBM. Each subplot shows the results for one synthetic network (either LFR or RECCS), defined by the real-world network (vertical) axes and clustering (horizontal axes). Gray boxes with "to" indicate time-outs and those with "X" indicate OOM (out-of-memory) errors. Gray boxes without text (N/A in legend) are for networks that are not available; see text for explanation.
  • Figure 4: Experiment 3a: Impact of WCC treatment on ARI scores of selected SBM (bar chart). WCC treatment of SBM clusterings benefits accuracy. The subplots marked with "N/A" indicate networks that are not available; see text for explanation. The subplot marked with "time-out" indicates that WCC failed to complete within 72 hours. The bottom left subplot appears empty due to both SBM and SBM+WCC yielding ARI accuracy of $0.0$.
  • Figure 5: Experiment 3b: ARI accuracies of SBM+WCC against various methods and their treatments on LFR and RECCS synthetic networks. SBM+WCC is generally competitive with Leiden-CPM(0.001) and Leiden-mod. Each subplot shows results for one synthetic network (either LFR or RECCS), defined by the real-world network (vertical) axes and clustering (horizontal axes). The "N/A" cells are those networks that are not available; see text for explanation. There are three out-of-memory (oom) entries (top row) and one "time-out" entry (bottom row).
  • ...and 3 more figures