Post-selection inference with a single realization of a network
Ethan Ancell, Daniela Witten, Daniel Kessler
TL;DR
The paper tackles post-selection inference for parameters defined from a single network realization by splitting the data into train and test networks via data thinning ($A^{(tr)}$, $A^{(te)}$) for Gaussian/Poisson edges or data fission for Bernoulli edges. It defines a data-driven target parameter based on estimated communities and provides selective confidence intervals that account for the data-dependent parameter selection, with finite-sample results for Gaussian edges and asymptotic results for Poisson and Bernoulli edges. The authors validate the approach with simulations under SBM-like setups and apply it to dolphin relationship data, showing meaningful within-vs-between connectivity differences under appropriate splits. The framework is model-agnostic beyond edge independence and distributional family, and comes with software to implement the procedure.
Abstract
Given a dataset consisting of a single realization of a network, we consider conducting inference on a parameter selected from the data. In particular, we focus on the setting where the parameter of interest is a linear combination of the mean connectivities within and between estimated communities. Inference in this setting poses a challenge, since the communities are themselves estimated from the data. Furthermore, since only a single realization of the network is available, sample splitting is not possible. In this paper, we show that it is possible to split a single realization of a network consisting of $n$ nodes into two (or more) networks involving the same $n$ nodes; the first network can be used to select a data-driven parameter, and the second to conduct inference on that parameter. In the case of weighted networks with Poisson or Gaussian edges, we obtain two independent realizations of the network; by contrast, in the case of Bernoulli edges, the two realizations are dependent, and so extra care is required. We establish the theoretical properties of our estimators, in the sense of confidence intervals that attain the nominal (selective) coverage, and demonstrate their utility in numerical simulations and in application to a dataset representing the relationships among dolphins in Doubtful Sound, New Zealand.
