Table of Contents
Fetching ...

Synthetic Networks That Preserve Edge Connectivity

Lahari Anne, The-Anh Vu-Le, Minhyuk Park, Tandy Warnow, George Chacko

TL;DR

This paper tackles the mismatch between synthetic networks generated by Stochastic Block Models (SBMs) and real-world clustered networks, particularly in edge connectivity within clusters. It introduces RECCS, a two-step pipeline that first enhances intra-cluster edge connectivity within an SBM-generated clustered subnetwork and then adds outliers via three strategies before merging into a full network. Across large real-world datasets, RECCS substantially improves alignment with cluster edge-connectivity metrics while preserving other statistics, offering two variant pipelines with differing strengths. The work provides a practical framework for generating more realistic ground-truth networks to evaluate community detection methods and sets the stage for exploring a broader range of clustering techniques on these synthetic networks.

Abstract

Since true communities within real-world networks are rarely known, synthetic networks with planted ground truths are valuable for evaluating the performance of community detection methods. Of the synthetic network generation tools available, Stochastic Block Models (SBMs) produce networks with ground truth clusters that well approximate input parameters from real-world networks and clusterings. However, we show that SBMs can produce disconnected ground truth clusters, even when given parameters from clusterings where all clusters are connected. Here we describe the REalistic Cluster Connectivity Simulator (RECCS), a technique that modifies an SBM synthetic network to improve the fit to a given clustered real-world network with respect to edge connectivity within clusters, while maintaining the good fit with respect to other network and cluster statistics. Using real-world networks up to 13.9 million nodes in size, we show that RECCS, applied to stochastic block models, results in synthetic networks that have a better fit to cluster edge connectivity than unmodified SBMs, while providing roughly the same quality fit for other network and clustering parameters as unmodified SBMs.

Synthetic Networks That Preserve Edge Connectivity

TL;DR

This paper tackles the mismatch between synthetic networks generated by Stochastic Block Models (SBMs) and real-world clustered networks, particularly in edge connectivity within clusters. It introduces RECCS, a two-step pipeline that first enhances intra-cluster edge connectivity within an SBM-generated clustered subnetwork and then adds outliers via three strategies before merging into a full network. Across large real-world datasets, RECCS substantially improves alignment with cluster edge-connectivity metrics while preserving other statistics, offering two variant pipelines with differing strengths. The work provides a practical framework for generating more realistic ground-truth networks to evaluate community detection methods and sets the stage for exploring a broader range of clustering techniques on these synthetic networks.

Abstract

Since true communities within real-world networks are rarely known, synthetic networks with planted ground truths are valuable for evaluating the performance of community detection methods. Of the synthetic network generation tools available, Stochastic Block Models (SBMs) produce networks with ground truth clusters that well approximate input parameters from real-world networks and clusterings. However, we show that SBMs can produce disconnected ground truth clusters, even when given parameters from clusterings where all clusters are connected. Here we describe the REalistic Cluster Connectivity Simulator (RECCS), a technique that modifies an SBM synthetic network to improve the fit to a given clustered real-world network with respect to edge connectivity within clusters, while maintaining the good fit with respect to other network and cluster statistics. Using real-world networks up to 13.9 million nodes in size, we show that RECCS, applied to stochastic block models, results in synthetic networks that have a better fit to cluster edge connectivity than unmodified SBMs, while providing roughly the same quality fit for other network and clustering parameters as unmodified SBMs.
Paper Structure (21 sections, 5 figures)

This paper contains 21 sections, 5 figures.

Figures (5)

  • Figure 1: RECCS Workflow. The two phases of the RECCS pipeline, which modifies an initial network (which has no singleton clusters) by adding edges to improve its fit to the input parameters. The first phase uses the input parameters (clustering, degree sequence, number of edges within and between clusters, and edge connectivity for each cluster) and adds edges within clusters to the starting network to achieve the required edge connectivity, and the second phase adds edges, potentially between clusters, to improve the fit to the degree sequence. See Section \ref{['sec:results']} for additional details.
  • Figure 2: Proportion of Disconnected clusters in SBM generated networks. The x-axis shows 110 SBM networks generated using parameters from real world networks clustered with the Leiden algorithm (training data). Since Leiden clusterings are guaranteed to be connected, this figure shows that SBM method failed to reproduce the connectivity of the real-world clusterings studied here.
  • Figure 3: Comparing SBM to the RECCS pipelines on the test networks. We compare SBM networks to networks produced using the two pipelines, RECCSv1+Strategy 1 and RECCSv2+Strategy 1, for different network and clustering statistics. The y-axis shows different distance metrics for various network properties. Error is reported using RMSE for degree sequence, outlier degree sequence, and minimum edge cuts sequence; scalar difference is shown for clustering coefficients and mixing parameter; relative difference is shown for the diameter, number of edges between outliers, and between outliers and clustered nodes. The test networks contain six real-world networks, each clustered using Leiden-CPM with $r=0.01$.
  • Figure 4: Accuracy of SBM and Two RECCS pipelines on Test Data, using Three Additional Clusterings The three additional clusterings are Leiden-CPM with $r=0.1$ (top row), Leiden-modularity (middle row), and the Iterative k-core (IKC) method (bottom row). The y-axis shows different distance metrics for various network properties. Error is reported using RMSE for degree sequence, outlier degree sequence, and minimum edge cuts sequence; scalar difference is shown for clustering coefficients and mixing parameter; relative difference is shown for the diameter, number of edges between outliers, and between outliers and clustered nodes.
  • Figure 5: Comparing SBM, RECCSv1, and RECCSv2 with respect to the normalized edit distance between synthetic and real world networks The normalized edit distance between the edge sets of the true network $G$ and the synthetic network $N$, i.e., $\frac{|E(G) \triangle E(N)|}{|E(G)|}$, where $\triangle$ denotes the symmetric difference, and so the maximum possible value is $2.0$. Each real-world network is clustered using Leiden-CPM, with $r=0.01$. Here, RECCSv2+Strategy 1 produces synthetic networks that are closer to the real-world network that RECCSv1+Strategy 1, and about as close as SBM networks.