Large Scale Community-Aware Network Generation
Vikram Ramavarapu, João Alfredo Cardoso Lamy, Mohammad Dindoost, David A. Bader
TL;DR
This work addresses the bottleneck of evaluating community detection without abundant ground-truth by introducing RECCS, RECCS+, and RECCS++—scalable synthetic network generators that preserve input clustering properties. The authors implement a parallel, multi-process pipeline that splits networks into clustered and singleton parts, fits SBMs, and applies a RECCS module to generate synthetic graphs, achieving up to 139× speedups and enabling networks with over 100 million nodes and nearly 2 billion edges. They demonstrate near-identical fidelity to the original RECCS on many statistics for RECCS+ while RECCS++ trades some accuracy for substantial speed gains, with large-scale networks completing where the original could not. Additionally, reclustering evaluations reveal that while RECCS+ and RECCS++ produce similar results to each other, they can differ from RECCS depending on the clustering method, underscoring the value of scalable synthetic benchmarks for robustness analyses and future encoder-decoder extensions in network generation.
Abstract
Community detection, or network clustering, is used to identify latent community structure in networks. Due to the scarcity of labeled ground truth in real-world networks, evaluating these algorithms poses significant challenges. To address this, researchers use synthetic network generators that produce networks with ground-truth community labels. RECCS is one such algorithm that takes a network and its clustering as input and generates a synthetic network through a modular pipeline. Each generated ground truth cluster preserves key characteristics of the corresponding input cluster, including connectivity, minimum degree, and degree sequence distribution. The output consists of a synthetically generated network, and disjoint ground truth cluster labels for all nodes. In this paper, we present two enhanced versions: RECCS+ and RECCS++. RECCS+ maintains algorithmic fidelity to the original RECCS while introducing parallelization through an orchestrator that coordinates algorithmic components across multiple processes and employs multithreading. RECCS++ builds upon this foundation with additional algorithmic optimizations to achieve further speedup. Our experimental results demonstrate that RECCS+ and RECCS++ achieve speedups of up to 49x and 139x respectively on our benchmark datasets, with RECCS++'s additional performance gains involving a modest accuracy tradeoff. With this newfound performance, RECCS++ can now scale to networks with over 100 million nodes and nearly 2 billion edges.
