Table of Contents
Fetching ...

Post-selection inference with a single realization of a network

Ethan Ancell, Daniela Witten, Daniel Kessler

TL;DR

The paper tackles post-selection inference for parameters defined from a single network realization by splitting the data into train and test networks via data thinning ($A^{(tr)}$, $A^{(te)}$) for Gaussian/Poisson edges or data fission for Bernoulli edges. It defines a data-driven target parameter based on estimated communities and provides selective confidence intervals that account for the data-dependent parameter selection, with finite-sample results for Gaussian edges and asymptotic results for Poisson and Bernoulli edges. The authors validate the approach with simulations under SBM-like setups and apply it to dolphin relationship data, showing meaningful within-vs-between connectivity differences under appropriate splits. The framework is model-agnostic beyond edge independence and distributional family, and comes with software to implement the procedure.

Abstract

Given a dataset consisting of a single realization of a network, we consider conducting inference on a parameter selected from the data. In particular, we focus on the setting where the parameter of interest is a linear combination of the mean connectivities within and between estimated communities. Inference in this setting poses a challenge, since the communities are themselves estimated from the data. Furthermore, since only a single realization of the network is available, sample splitting is not possible. In this paper, we show that it is possible to split a single realization of a network consisting of $n$ nodes into two (or more) networks involving the same $n$ nodes; the first network can be used to select a data-driven parameter, and the second to conduct inference on that parameter. In the case of weighted networks with Poisson or Gaussian edges, we obtain two independent realizations of the network; by contrast, in the case of Bernoulli edges, the two realizations are dependent, and so extra care is required. We establish the theoretical properties of our estimators, in the sense of confidence intervals that attain the nominal (selective) coverage, and demonstrate their utility in numerical simulations and in application to a dataset representing the relationships among dolphins in Doubtful Sound, New Zealand.

Post-selection inference with a single realization of a network

TL;DR

The paper tackles post-selection inference for parameters defined from a single network realization by splitting the data into train and test networks via data thinning (, ) for Gaussian/Poisson edges or data fission for Bernoulli edges. It defines a data-driven target parameter based on estimated communities and provides selective confidence intervals that account for the data-dependent parameter selection, with finite-sample results for Gaussian edges and asymptotic results for Poisson and Bernoulli edges. The authors validate the approach with simulations under SBM-like setups and apply it to dolphin relationship data, showing meaningful within-vs-between connectivity differences under appropriate splits. The framework is model-agnostic beyond edge independence and distributional family, and comes with software to implement the procedure.

Abstract

Given a dataset consisting of a single realization of a network, we consider conducting inference on a parameter selected from the data. In particular, we focus on the setting where the parameter of interest is a linear combination of the mean connectivities within and between estimated communities. Inference in this setting poses a challenge, since the communities are themselves estimated from the data. Furthermore, since only a single realization of the network is available, sample splitting is not possible. In this paper, we show that it is possible to split a single realization of a network consisting of nodes into two (or more) networks involving the same nodes; the first network can be used to select a data-driven parameter, and the second to conduct inference on that parameter. In the case of weighted networks with Poisson or Gaussian edges, we obtain two independent realizations of the network; by contrast, in the case of Bernoulli edges, the two realizations are dependent, and so extra care is required. We establish the theoretical properties of our estimators, in the sense of confidence intervals that attain the nominal (selective) coverage, and demonstrate their utility in numerical simulations and in application to a dataset representing the relationships among dolphins in Doubtful Sound, New Zealand.

Paper Structure

This paper contains 40 sections, 18 theorems, 118 equations, 11 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

Suppose that $\epsilon \in (0,1)$, and $A_{ij} \overset{\text{ind.}}{\sim} \mathcal{N}(M_{ij}, \tau^2)$ for $i=1,\ldots,n$ and $j=1,\ldots,n$. For $A^{(\textnormal{tr})}_{ij} \mid A_{ij} \overset{\text{ind.}}{\sim} \mathcal{N}(\epsilon A_{ij}, \epsilon (1- \epsilon) \tau^2)$ and $A^{(\textnormal{te}

Figures (11)

  • Figure 1: We consider the setting where an analyst (a) uses a single realization of a network $A$ to select a parameter, and (b) proceeds to conduct inference on that parameter. In step (b), it is crucial to account for the fact that the parameter was selected using the data.
  • Figure 2: (a):chen2018NetworkCrossValidationDetermining propose partitioning the nodes into two disjoint sets, depicted with solid and dashed circles. Edges incident to solid nodes are used for training, and testing is performed using the remaining edges. (b):li2020NetworkCrossvalidationEdge propose partitioning the edges into two disjoint sets: training uses the first set with the aid of matrix completion, and testing uses the second set. (c): For networks with Bernoulli edges, our proposal produces a train network by "toggling" each edge (or non-edge) with probability $\gamma \in (0, 0.5)$ (see Proposition \ref{['prop:univariate_bernoulli_fission']}). The conditional distribution of the original network given the train network is used for inference.
  • Figure 3: Simulations comparing $|V_{11}(A^{(\textnormal{tr})}) - B_{11}(A^{(\textnormal{tr})})|$ (blue curves) and $|\Phi_{11}(A^{(\textnormal{tr})}) - B_{11}(A^{(\textnormal{tr})})|$ (red curves) where $B_{k \ell}(A^{(\textnormal{tr})})$, $V_{k \ell}(A^{(\textnormal{tr})})$, and $\Phi_{k \ell}(A^{(\textnormal{tr})})$ are defined in \ref{['eq:B_entry_definition']}, \ref{['eq:Vkl']}, and \ref{['eq:Phi_kl_def']} respectively, plotted over a range of $\gamma$. The networks have $n=100$ nodes, and results are averaged across 5,000 repetitions. Setting 1:$M_{ij}=0.5$ for all $i$ and $j$. Setting 2: The entries of $M$ belong to two equally-sized communities, where the intra-community entries of $M$ equal $0.6$ and the inter-community entries equal $0.4$. Setting 3: Each entry of $M$ is drawn from a $\mathop{\mathrm{Uniform}}\nolimits(0,1)$ distribution.
  • Figure 4: Results for Gaussian edges, averaged over 5,000 simulated networks. Left: Empirical versus nominal coverage of the confidence intervals for $B_{11}(A^{(\textnormal{tr})})$ (proposed approach as described in Proposition \ref{['prop:poisson_estimation']}) or $B_{11}(A)$ (naive approach as described in Supplement \ref{['appendix:naive-cis']}), with $n=200$, $K^{\text{true}}=5$, $\rho_1=30$, $\rho_2=27$, $\tau^2 = 25$, and $\epsilon=0.5$ for the proposed approach. Center and Right: Average adjusted Rand index between true and estimated communities, and average 90% confidence interval width, as a function of $\epsilon$, for the proposed approach on networks with $n=200$, $K = 5$, $K^{\text{true}} = 5$, $\rho_1 = 30$, and $\tau^2 = 25$.
  • Figure 5: Results for Poisson edges. All other details are the same as Figure \ref{['fig:conf_width_rand_gaussian']}.
  • ...and 6 more figures

Theorems & Definitions (28)

  • Proposition 1: Thinning for Gaussian edges
  • Proposition 2: Thinning for Poisson edges
  • Proposition 3: Fission for Bernoulli edges
  • Remark 1
  • Proposition 4
  • Proposition 5
  • Remark 2
  • Proposition 6
  • Remark 3
  • Remark 4
  • ...and 18 more