Table of Contents
Fetching ...

Scalable Community Detection in Massive Networks Using Aggregated Relational Data

Timothy Jones, Owen G. Ward, Yiran Jiang, John Paisley, Tian Zheng

TL;DR

This paper tackles the scalability of community detection under the Mixed Membership Stochastic BlockModel (MMSB) for massive networks by introducing Aggregated Relational Data (ARD) to form mini-batches from nodal information. It develops ARDMMSB, which uses a Poisson-based ARD likelihood $y_{ik} \sim \text{Poisson}(N_k \boldsymbol{\pi}_i^T B \boldsymbol{\eta}_k)$ and a parallel, variational inference scheme with auxiliary variables to estimate $B$, $\boldsymbol{\pi}$, and $\boldsymbol{\eta}$ from aggregated counts. The method demonstrates strong parameter recovery and improved convergence on simulated MMSB data and shows meaningful, interpretable structure in a large citation network, outperforming subgraph-based SVI in several metrics. By leveraging nodal covariates to define subpopulations and aggregating ties, ARDMMSB offers scalable and parallelizable inference for massive networks with overlapping communities, with potential extensions to degree correction and broader subpopulation definitions.

Abstract

The mixed membership stochastic blockmodel (MMSB) is a popular Bayesian network model for community detection. Fitting such large Bayesian network models quickly becomes computationally infeasible when the number of nodes grows into hundreds of thousands and millions. In this paper we propose a novel mini-batch strategy based on aggregated relational data that leverages nodal information to fit MMSB to massive networks. We describe a scalable inference method that can utilize nodal information that often accompanies real-world networks. Conditioning on this extra information leads to a model that admits a parallel stochastic variational inference algorithm, utilizing stochastic gradients of bipartite graph formed from aggregated network ties between node subpopulations. We apply our method to a citation network with over two million nodes and 25 million edges, capturing explainable structure in this network. Our method recovers parameters and achieves better convergence on simulated networks generated according to the MMSB.

Scalable Community Detection in Massive Networks Using Aggregated Relational Data

TL;DR

This paper tackles the scalability of community detection under the Mixed Membership Stochastic BlockModel (MMSB) for massive networks by introducing Aggregated Relational Data (ARD) to form mini-batches from nodal information. It develops ARDMMSB, which uses a Poisson-based ARD likelihood and a parallel, variational inference scheme with auxiliary variables to estimate , , and from aggregated counts. The method demonstrates strong parameter recovery and improved convergence on simulated MMSB data and shows meaningful, interpretable structure in a large citation network, outperforming subgraph-based SVI in several metrics. By leveraging nodal covariates to define subpopulations and aggregating ties, ARDMMSB offers scalable and parallelizable inference for massive networks with overlapping communities, with potential extensions to degree correction and broader subpopulation definitions.

Abstract

The mixed membership stochastic blockmodel (MMSB) is a popular Bayesian network model for community detection. Fitting such large Bayesian network models quickly becomes computationally infeasible when the number of nodes grows into hundreds of thousands and millions. In this paper we propose a novel mini-batch strategy based on aggregated relational data that leverages nodal information to fit MMSB to massive networks. We describe a scalable inference method that can utilize nodal information that often accompanies real-world networks. Conditioning on this extra information leads to a model that admits a parallel stochastic variational inference algorithm, utilizing stochastic gradients of bipartite graph formed from aggregated network ties between node subpopulations. We apply our method to a citation network with over two million nodes and 25 million edges, capturing explainable structure in this network. Our method recovers parameters and achieves better convergence on simulated networks generated according to the MMSB.

Paper Structure

This paper contains 26 sections, 17 equations, 13 figures, 2 algorithms.

Figures (13)

  • Figure 1: Data Generating Processes for the MMSB
  • Figure 2: Left: Graphical representation of a two-node segment of the ARD network. The complete model contains $y_{ik}$ for every node, subpopulation pair. Circles denote variables and observed variables are shared. The plates contain variables to be replicated. Right: Data Generating Process for Aggregated Relational Data for MMSB.
  • Figure 3: Illustration of inference process for multiple passes. Each tall orange rectangle represents all of the variational parameters for the nodes. The blue blocks represent the subpopulation and blockmatrix parameters while the orange blocks represent the parameters for each node. In each pass, the orange blocks are broken up into minibatches. Each of the minibatches are passed along with the current blue parameters and fit through the algorithm. After the pass, the orange blocks are stored and the blue parameters are averaged over before being stored.
  • Figure 4: Left: Boxplots of normalized mutual information (NMI) among the subgraphs considered in simulations, using SVI for subgraphs of size $n$ of the underlying adjacency matrix and ARD subgraphs of size $n$. Right: Posterior means and standard errors of estimation of diagonals of blockmatrix for SVI on subgraphs of the adjacency matrix and ARD data of size $n$. The true values of the block matrix are given as dashed horizontal lines.
  • Figure 5: Comparison between ARDMMSB (ARD) and Gopolan et. al.'s Stochastic Variational Inference (SVI), showing the average predictive log likelihood with subgraphs of size $n=500$.
  • ...and 8 more figures