Scalable Community Detection in Massive Networks Using Aggregated Relational Data
Timothy Jones, Owen G. Ward, Yiran Jiang, John Paisley, Tian Zheng
TL;DR
This paper tackles the scalability of community detection under the Mixed Membership Stochastic BlockModel (MMSB) for massive networks by introducing Aggregated Relational Data (ARD) to form mini-batches from nodal information. It develops ARDMMSB, which uses a Poisson-based ARD likelihood $y_{ik} \sim \text{Poisson}(N_k \boldsymbol{\pi}_i^T B \boldsymbol{\eta}_k)$ and a parallel, variational inference scheme with auxiliary variables to estimate $B$, $\boldsymbol{\pi}$, and $\boldsymbol{\eta}$ from aggregated counts. The method demonstrates strong parameter recovery and improved convergence on simulated MMSB data and shows meaningful, interpretable structure in a large citation network, outperforming subgraph-based SVI in several metrics. By leveraging nodal covariates to define subpopulations and aggregating ties, ARDMMSB offers scalable and parallelizable inference for massive networks with overlapping communities, with potential extensions to degree correction and broader subpopulation definitions.
Abstract
The mixed membership stochastic blockmodel (MMSB) is a popular Bayesian network model for community detection. Fitting such large Bayesian network models quickly becomes computationally infeasible when the number of nodes grows into hundreds of thousands and millions. In this paper we propose a novel mini-batch strategy based on aggregated relational data that leverages nodal information to fit MMSB to massive networks. We describe a scalable inference method that can utilize nodal information that often accompanies real-world networks. Conditioning on this extra information leads to a model that admits a parallel stochastic variational inference algorithm, utilizing stochastic gradients of bipartite graph formed from aggregated network ties between node subpopulations. We apply our method to a citation network with over two million nodes and 25 million edges, capturing explainable structure in this network. Our method recovers parameters and achieves better convergence on simulated networks generated according to the MMSB.
