Table of Contents
Fetching ...

Bayesian Modeling for Aggregated Relational Data: A Unified Perspective

Owen G. Ward, Anna L. Smith, Tian Zheng

Abstract

Aggregated relational data is widely collected to study social networks, in fields such as sociology, public health and economics. Many of the successes of ARD inference have been driven by increasingly complex Bayesian models, which provide principled and flexible ways of reflecting dependence patterns and biases encountered in real data. In this work we provide researchers with a unified collection of Bayesian implementations of existing models for ARD, within the state-of-the-art Bayesian sampling language Stan. Our implementations incorporate within-iteration rescaling procedures by default, improving algorithm run time and convergence diagnostics. Estimating ARD parameters requires carefully balancing model complexity against computational cost and data requirements, yet this trade-off has received relatively limited systematic attention in the literature. Moreover, general model comparison tools applicable across a wide range of ARD models remain underdeveloped, and existing approaches often require substantial expertise in Bayesian computation and software. Using synthetic data, we demonstrate how well competing models recover true personal network sizes and subpopulation sizes and how existing posterior predictive checks compare across a range of Bayesian ARD models. We provide code to leverage Stan's modeling framework for exact $K$-fold cross-validation, and explain why approximate leave-one-out estimates often fail for many ARD models. This work highlights important connections and future directions in Bayesian modeling of ARD, providing practical guidance for selecting and evaluating Bayesian ARD models.

Bayesian Modeling for Aggregated Relational Data: A Unified Perspective

Abstract

Aggregated relational data is widely collected to study social networks, in fields such as sociology, public health and economics. Many of the successes of ARD inference have been driven by increasingly complex Bayesian models, which provide principled and flexible ways of reflecting dependence patterns and biases encountered in real data. In this work we provide researchers with a unified collection of Bayesian implementations of existing models for ARD, within the state-of-the-art Bayesian sampling language Stan. Our implementations incorporate within-iteration rescaling procedures by default, improving algorithm run time and convergence diagnostics. Estimating ARD parameters requires carefully balancing model complexity against computational cost and data requirements, yet this trade-off has received relatively limited systematic attention in the literature. Moreover, general model comparison tools applicable across a wide range of ARD models remain underdeveloped, and existing approaches often require substantial expertise in Bayesian computation and software. Using synthetic data, we demonstrate how well competing models recover true personal network sizes and subpopulation sizes and how existing posterior predictive checks compare across a range of Bayesian ARD models. We provide code to leverage Stan's modeling framework for exact -fold cross-validation, and explain why approximate leave-one-out estimates often fail for many ARD models. This work highlights important connections and future directions in Bayesian modeling of ARD, providing practical guidance for selecting and evaluating Bayesian ARD models.

Paper Structure

This paper contains 35 sections, 15 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Links between Bayesian models for aggregated relational data. All models which assume the overdispersed negative binomial distribution also include an overdispersion parameter, $\omega_k$.
  • Figure 2: Distribution of the true degree of the nodes in ARD sample for each of the synthetic datasets used.
  • Figure 3: Trace plots of the posterior draws of the log-degree parameter in the Erdos Renyi model, with and without scaling. The trace plot without scaling shows poor convergence, indicating the model fails to fit well. This is seen in the associated values of $\hat{R}$.
  • Figure 4: True degree distribution and posterior degree distribution under common ARD models for each of the synthetic datasets examined.
  • Figure 5: Recovery of subpopulation sizes for all subpopulations under the zheng_how_2006 model for the replicate mccarty2001comparing data. Black points indicate the true size of each subpopulation. Colored points (red for known subpopulations and blue for the unknown subpopulation) indicate posterior medians, with thin and thick colored bars representing 90% and 50% credible intervals respectively. Subpopulations are ordered by their posterior median size.
  • ...and 3 more figures