Improved Community Detection using Stochastic Block Models

Minhyuk Park; Daniel Wang Feng; Siya Digra; The-Anh Vu-Le; George Chacko; Tandy Warnow

Improved Community Detection using Stochastic Block Models

Minhyuk Park, Daniel Wang Feng, Siya Digra, The-Anh Vu-Le, George Chacko, Tandy Warnow

TL;DR

This work investigates the tendency of stochastic block models (SBMs) to produce disconnected clusters on large real-world and synthetic networks. It introduces simple post-processing strategies—Connected Components (CC), Well-Connected Clusters (WCC), and the Connectivity Modifier (CM)—to enforce edge-connectivity and improve clustering quality. Across 122 real networks and numerous synthetic LFR benchmarks, CC and especially WCC enhance accuracy metrics such as ARI, AMI, and NMI, while maintaining reasonable coverage; CM is more variable and can hurt performance. The findings provide practical, low-complexity remedies to bolster SBM-based community detection in large-scale graphs, with WCC recommended as the default post-processing step.

Abstract

Community detection approaches resolve complex networks into smaller groups (communities) that are expected to be relatively edge-dense and well-connected. The stochastic block model (SBM) is one of several approaches used to uncover community structure in graphs. In this study, we demonstrate that SBM software applied to various real-world and synthetic networks produces poorly-connected to disconnected clusters. We present simple modifications to improve the connectivity of SBM clusters, and show that the modifications improve accuracy using simulated networks.

Improved Community Detection using Stochastic Block Models

TL;DR

Abstract

Paper Structure (18 sections, 1 equation, 4 figures, 2 tables)

This paper contains 18 sections, 1 equation, 4 figures, 2 tables.

Introduction
Materials and Methods
Networks
Real-world networks
Synthetic networks
Stochastic Block Models
Post-processing treatments to improve connectivity
Evaluation
Performance Study and Results
Experiments
Experiment 1: Connectivity of SBMs
Experiment 2: Impact of treatments on real-world networks
Experiment 3: Impact of Treatment on Synthetic Networks
Discussion
Summary of trends
...and 3 more sections

Figures (4)

Figure 1: Experiment 1: Cluster Connectivity of SBM on 120 Real-World Networks Percentage of disconnected, poorly connected, and well-connected clusters are shown for the selected SBM clustering of 120 real world networks. Each colored bar represents a single network, white bars separate the network groups into small, medium, and large. Two of the datasets from the initial set of 122 datasets are not represented here since the selected SBM model returned no non-singleton clusters.
Figure 2: Experiment 2: Impact of Treatment on Cluster Sizes of Medium and Large Real-World Networks The distribution of non-singleton cluster sizes is shown as a boxplot for the selected SBM and its treatments. The y-axis is plotted on a log scale with the whiskers indicating the minimum and maximum cluster sizes in all of the networks in the group. Both groups and treatments have minimum cluster size of 2 for SBM clusterings whether treated or not, but differ in the medians and maxes, as follows. Medium group median/max: SBM: 103/403801, SBM+CC: 2/38539, SBM+WCC: 2/2966, SBM+CM: 6/2169. Large group median/max: SBM: 507/777770, SBM+CC: 3/337018, SBM+WCC: 3/4387, SBM+CM: 9/3258.
Figure 3: Experiment 3: Impact of Treatment on ARI Scores of Selected SBM (heatmap). Each LFR network is based on a Leiden clustering of a real-world network, with the column indicating the real-world network and the row specifying the optimization problem (either modularity or CPM for a given resolution value). Blue indicates that post-processing using the corresponding treatment improves ARI accuracy for the clustering method, orange and red indicate that treatment hurts ARI accuracy, and yellow indicates neutral impact. We use "n.a." to indicate that a network was either not used because of too many disconnected ground-truth clusters or that the LFR software failed to generate the network, and "t.o." to indicate that WCC failed to complete within 72 hours.
Figure 4: Experiment 3: Impact of WCC Treatment on ARI Scores of Selected SBM (bar chart). Some LFR networks had too many disconnected ground truth clusters or failed to generate, and so results on these networks are not provided and are marked as "n.a.". WCC on the LFR network for cit_patents with Leiden optimizing CPM under $r=0.5$ timed out after 72 hours, and is marked as "t.o.". On the LFR network for cit_hepph with $r=0.5$, both the selected SBM model and its follow-up WCC yielded 0.0 ARI accuracy.

Improved Community Detection using Stochastic Block Models

TL;DR

Abstract

Improved Community Detection using Stochastic Block Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)