Covariate Selection for Joint Latent Space Modeling of Sparse Network Data

Emma G Crenshaw; Yuhua Zhang; Jukka-Pekka Onnela

Covariate Selection for Joint Latent Space Modeling of Sparse Network Data

Emma G Crenshaw, Yuhua Zhang, Jukka-Pekka Onnela

TL;DR

This work tackles covariate selection in joint latent-space models for sparse networks with high-dimensional binary node attributes. It introduces a two-stage approach that first fits a joint latent-space model and then performs group-lasso covariate screening with a measurement-error–aware stabilization to account for uncertainty in estimated latent positions, followed by refitting on the selected covariates. Theoretical results establish prediction-error rates of order $O(\log q / n)$ (up to latent-position estimation error) under high-dimensional conditions, with exact rates distinguished for perfectly observed versus estimated latent positions. Empirical studies show improved robustness to covariate sparsity and substantial data-collection savings in real-world network applications, illustrating the method's usefulness for efficient study design in network-based social and epidemiological research.

Abstract

Network data are increasingly common in the social sciences and infectious disease epidemiology. Analyses often link network structure to node-level covariates, but existing methods falter with sparse networks and high-dimensional node features. We propose a joint latent space modeling framework for sparse networks with high-dimensional binary node covariates that performs covariate selection while accounting for uncertainty in estimated latent positions. Building on joint latent space models that couple edges and node variables through shared latent positions, we introduce a group lasso screening step and incorporate a measurement-error-aware stabilization term to mitigate bias from using estimated latent positions as predictors. We establish prediction error rates for the covariate component both when latent positions are treated as observed and when they are estimated with bounded error; under uniform control across $q$ covariates and $n$ nodes, the rate is of order $O(\log q / n)$ up to an additional term due to latent position estimation error. Our method addresses three challenges: (1) incorporating information from isolated nodes, which are common in sparse networks but often ignored; (2) selecting relevant covariates from high-dimensional spaces; and (3) accounting for uncertainty in estimated latent positions. Simulations show predictive performance remains stable as covariate sparsity grows, while naive approaches degrade. We illustrate how the method can support efficient study design using household social networks from 75 Indian villages, where an emulated pilot study screens a large covariate battery and substantially reduces required subsequent data collection without sacrificing network predictive accuracy.

Covariate Selection for Joint Latent Space Modeling of Sparse Network Data

TL;DR

(up to latent-position estimation error) under high-dimensional conditions, with exact rates distinguished for perfectly observed versus estimated latent positions. Empirical studies show improved robustness to covariate sparsity and substantial data-collection savings in real-world network applications, illustrating the method's usefulness for efficient study design in network-based social and epidemiological research.

Abstract

covariates and

nodes, the rate is of order

up to an additional term due to latent position estimation error. Our method addresses three challenges: (1) incorporating information from isolated nodes, which are common in sparse networks but often ignored; (2) selecting relevant covariates from high-dimensional spaces; and (3) accounting for uncertainty in estimated latent positions. Simulations show predictive performance remains stable as covariate sparsity grows, while naive approaches degrade. We illustrate how the method can support efficient study design using household social networks from 75 Indian villages, where an emulated pilot study screens a large covariate battery and substantially reduces required subsequent data collection without sacrificing network predictive accuracy.

Paper Structure (15 sections, 4 theorems, 61 equations, 2 figures, 3 tables, 2 algorithms)

This paper contains 15 sections, 4 theorems, 61 equations, 2 figures, 3 tables, 2 algorithms.

Introduction
Model
Model Set Up and Notation
Estimation
Theoretical Results
Perfectly observed $Z$
Accounting for $Z$ with Measurement Error
Simulations
Simulation Design
Simulation Results
Application to Real Data
Discussion
Reminder of Model Notation and Assumptions
Proof of Theorem \ref{['theorem1']}
Proof of Theorem 2

Key Result

theorem 1

Under Assumptions assumption1–assumption5: (a) For any fixed $j \in \{1,\dots,q\}$, with $\lambda_j = C\sqrt{k/n}$ for some constant $C$: (b) For uniform control over all $j \in \{1,\dots,q\}$, with $\lambda = C\sqrt{\log q/n}$ for some constant $C$:

Figures (2)

Figure 1: Model Performance in Simulations. Boxplots represent the results from 30 independently simulated networks of $n = 200$ nodes. Sparse networks have an average density of 0.013 (average degree = 2.6); less sparse networks have an average density of 0.37 (average degree = 74). Panels A and B show the AUC for network data in the sparse and less sparse networks, respectively. Panels C and D show the AUC for node covariate data for the sparse and less sparse networks, respectively.
Figure 2: Model performance predicting node covariates with and without an emulated pilot study. The figure shows the average AUC for node covariates on 65 villages not selected to be in a pilot study. Markers in black show the average AUC over all 32 variables when estimated on the full covariate set. 'X' Markers in blue show the average AUC when modeling only the covariates identified as of interest in the emulated pilot study. Markers in red, described as "full covariate set (limited)", show the average AUC on variables retained by the pilot study but estimated using all covariates. Villages are ordered by increasing AUC for the limited covariate set.

Theorems & Definitions (6)

theorem 1: Prediction Error Rate, Perfectly Observed $Z$
theorem 2: Prediction Error Rate, Estimated $\hat{Z}$
theorem 1: Prediction Error Rate, Perfectly Observed $Z$
proof
theorem 2
proof

Covariate Selection for Joint Latent Space Modeling of Sparse Network Data

TL;DR

Abstract

Covariate Selection for Joint Latent Space Modeling of Sparse Network Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (6)