Table of Contents
Fetching ...

Large Scale Community-Aware Network Generation

Vikram Ramavarapu, João Alfredo Cardoso Lamy, Mohammad Dindoost, David A. Bader

TL;DR

This work addresses the bottleneck of evaluating community detection without abundant ground-truth by introducing RECCS, RECCS+, and RECCS++—scalable synthetic network generators that preserve input clustering properties. The authors implement a parallel, multi-process pipeline that splits networks into clustered and singleton parts, fits SBMs, and applies a RECCS module to generate synthetic graphs, achieving up to 139× speedups and enabling networks with over 100 million nodes and nearly 2 billion edges. They demonstrate near-identical fidelity to the original RECCS on many statistics for RECCS+ while RECCS++ trades some accuracy for substantial speed gains, with large-scale networks completing where the original could not. Additionally, reclustering evaluations reveal that while RECCS+ and RECCS++ produce similar results to each other, they can differ from RECCS depending on the clustering method, underscoring the value of scalable synthetic benchmarks for robustness analyses and future encoder-decoder extensions in network generation.

Abstract

Community detection, or network clustering, is used to identify latent community structure in networks. Due to the scarcity of labeled ground truth in real-world networks, evaluating these algorithms poses significant challenges. To address this, researchers use synthetic network generators that produce networks with ground-truth community labels. RECCS is one such algorithm that takes a network and its clustering as input and generates a synthetic network through a modular pipeline. Each generated ground truth cluster preserves key characteristics of the corresponding input cluster, including connectivity, minimum degree, and degree sequence distribution. The output consists of a synthetically generated network, and disjoint ground truth cluster labels for all nodes. In this paper, we present two enhanced versions: RECCS+ and RECCS++. RECCS+ maintains algorithmic fidelity to the original RECCS while introducing parallelization through an orchestrator that coordinates algorithmic components across multiple processes and employs multithreading. RECCS++ builds upon this foundation with additional algorithmic optimizations to achieve further speedup. Our experimental results demonstrate that RECCS+ and RECCS++ achieve speedups of up to 49x and 139x respectively on our benchmark datasets, with RECCS++'s additional performance gains involving a modest accuracy tradeoff. With this newfound performance, RECCS++ can now scale to networks with over 100 million nodes and nearly 2 billion edges.

Large Scale Community-Aware Network Generation

TL;DR

This work addresses the bottleneck of evaluating community detection without abundant ground-truth by introducing RECCS, RECCS+, and RECCS++—scalable synthetic network generators that preserve input clustering properties. The authors implement a parallel, multi-process pipeline that splits networks into clustered and singleton parts, fits SBMs, and applies a RECCS module to generate synthetic graphs, achieving up to 139× speedups and enabling networks with over 100 million nodes and nearly 2 billion edges. They demonstrate near-identical fidelity to the original RECCS on many statistics for RECCS+ while RECCS++ trades some accuracy for substantial speed gains, with large-scale networks completing where the original could not. Additionally, reclustering evaluations reveal that while RECCS+ and RECCS++ produce similar results to each other, they can differ from RECCS depending on the clustering method, underscoring the value of scalable synthetic benchmarks for robustness analyses and future encoder-decoder extensions in network generation.

Abstract

Community detection, or network clustering, is used to identify latent community structure in networks. Due to the scarcity of labeled ground truth in real-world networks, evaluating these algorithms poses significant challenges. To address this, researchers use synthetic network generators that produce networks with ground-truth community labels. RECCS is one such algorithm that takes a network and its clustering as input and generates a synthetic network through a modular pipeline. Each generated ground truth cluster preserves key characteristics of the corresponding input cluster, including connectivity, minimum degree, and degree sequence distribution. The output consists of a synthetically generated network, and disjoint ground truth cluster labels for all nodes. In this paper, we present two enhanced versions: RECCS+ and RECCS++. RECCS+ maintains algorithmic fidelity to the original RECCS while introducing parallelization through an orchestrator that coordinates algorithmic components across multiple processes and employs multithreading. RECCS++ builds upon this foundation with additional algorithmic optimizations to achieve further speedup. Our experimental results demonstrate that RECCS+ and RECCS++ achieve speedups of up to 49x and 139x respectively on our benchmark datasets, with RECCS++'s additional performance gains involving a modest accuracy tradeoff. With this newfound performance, RECCS++ can now scale to networks with over 100 million nodes and nearly 2 billion edges.

Paper Structure

This paper contains 16 sections, 4 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: A general network clustering evaluation pipeline. A reference network and clustering is taken in as input to a synthetic network generator, which generates multiple synthetic networks with ground truth communities. When a clustering method is run on these generated networks, they can be compared with the ground truth community labels for accuracy.
  • Figure 2: Network sizes and number of clusters in the 73 Netzschleuder set. The top and middle panels show the number of nodes ($n$) and edges ($m$) respectively, while the bottom panel shows the number of clusters ($c$) in each network.
  • Figure 3: Cluster size distribution across all large networks used for scalability experiments. We use the Orkut, CEN, OC, and OCv2 networks for scalability testing. The networks are clustered with Leiden 0.01, both with and with out Connectivity Modifier (CM) treatment.
  • Figure 4: The full pipeline of RECCS+ and RECCS++. First, statistics are collected for the input clustering. In the meantime, a splitter outputs the clustered and singleton sunetworks. When the splitter is finished, SBMs are run on both subnetworks in parallel. When the clustered SBM and the stats are both done, the RECCS module is run. When the RECCS module and the singleton SBM are finished, they are merged into a final output graph. All C++ processes are labeled in pink, while all Python processes are blue.
  • Figure 5: Speedup of the SBM stage after optimization. Algorithmic, data structure, and parallelization optimizations were used to speed up SBM generation. New SBM runs utilize 64 threads. Speedup: Cit-HepPh: 9.7x, CEN: 3.9x.
  • ...and 5 more figures