Low-order outcomes and clustered designs: combining design and analysis for causal inference under network interference

Matthew Eichhorn; Samir Khan; Johan Ugander; Christina Lee Yu

Low-order outcomes and clustered designs: combining design and analysis for causal inference under network interference

Matthew Eichhorn, Samir Khan, Johan Ugander, Christina Lee Yu

TL;DR

The paper addresses causal inference under network interference by integrating low-order β-order outcome models with graph-cluster randomized designs. It introduces a generalized pseudoinverse estimator for the total treatment effect that remains effective under arbitrary designs, and provides precise bias and variance bounds, including specialized results for Bernoulli graph cluster randomized designs. The results show that jointly optimizing over the estimator and the design yields variance reductions beyond what either approach achieves alone, and guide practical clustering choices. Empirical evidence demonstrates the bounds’ usefulness for selecting clusterings across diverse graphs and response models, with Monte Carlo methods enabling application to complex designs. Overall, the framework offers a scalable, design-aware pathway to robust causal inference in interference settings and suggests directions for further theoretical and methodological development.

Abstract

Variance reduction for causal inference in the presence of network interference is often achieved through either outcome modeling, typically analyzed under unit-randomized Bernoulli designs, or clustered experimental designs, typically analyzed without strong parametric assumptions. In this work, we study the intersection of these two approaches and make the following threefold contributions. First, we present an estimator of the total treatment effect (or global average treatment effect) in low-order outcome models when the data are collected under general experimental designs, generalizing previous results for Bernoulli designs. We refer to this estimator as the pseudoinverse estimator and give bounds on its bias and variance in terms of properties of the experimental design. Second, we evaluate these bounds for the case of Bernoulli graph cluster randomized (GCR) designs. Its variance scales like the smaller of the variance obtained by the estimator derived under a low-order assumption, and the variance obtained from cluster randomization, showing that combining these variance reduction strategies is preferable to using either individually. When the order of the potential outcomes model is correctly specified, our estimator is always unbiased, and under a misspecified model, we upper bound the bias by the closeness of the ground truth model to a low-order model. Third, we give empirical evidence that our variance bounds can be used to select a good clustering that minimizes the worst-case variance under a cluster randomized design from a set of candidate clusterings. Across a range of graphs and clustering algorithms, our method consistently selects clusterings that perform well on a range of response models, suggesting the practical use of our bounds.

Low-order outcomes and clustered designs: combining design and analysis for causal inference under network interference

TL;DR

Abstract

Paper Structure (39 sections, 17 theorems, 130 equations, 6 figures, 1 table)

This paper contains 39 sections, 17 theorems, 130 equations, 6 figures, 1 table.

Introduction
Summary Of Results
Pseudoinverse estimators.
Bernoulli GCR designs.
Empirical validation for optimizing design of clustered experiments.
Related Work
Neighborhood exposure mappings.
Graph structure assumptions.
Outcome model assumptions.
Design of clustered experiments.
Set-up and Notation
Low-order Outcome Models
Total Treatment Effect
Experimental Designs
The Pseudoinverse Estimator
...and 24 more sections

Key Result

Theorem 1

where $\mathbf{c}_i^{>\beta}$ contains coefficients that correspond to causal effects of neighborhood subsets of size larger than $\beta$. The variance is upper-bounded by where $B$ is an absolute bound on the potential outcomes, $C$ is the maximum number of clusters an individual is influenced by, $N$ is the maximum cluster size, $d_{\max}$ is the maximum neighborhood size of an individual, $n$

Figures (6)

Figure 1: The graph $G$ (left) that we consider in our experiments showing that $\widehat{\mathop{\mathrm{TTE}}\nolimits}_{\beta}$ has lower variance than the Horvitz--Thompson estimator. $G$ is the third power of a cycle on $n=840$ vertices, so each vertex has degree 7 (3 neighbors on each side and a self-loop). Two clusterings of $G$ that we consider are shown on the right. The value of $w$ controls the average cluster-degree $|\mathcal{C}(\mathcal{N}_i)|$. When $w=1$, we have that $|\mathcal{C}(\mathcal{N}_i)|=d_i=7$ for all $i$, while for $w=7$, the nodes in the center of each cluster have $|\mathcal{C}(\mathcal{N}_i)|=1$.
Figure 2: MSE of the Horvitz--Thompson and well-specified pseudoinverse estimators for different values of $\beta^*$ and $w$ averaged over 1000 trials. Note that the number of clusters is $m=n/w$. We see across both designs that (a) the pseudoinverse estimator consistently improves on the Horvitz--Thompson estimator; (b) this improvement is largest for small values of $w$ (which are poor clusterings of the graph); (c) the variance of the pseudoinverse is less sensitive to the quality of the clustering for small values of $\beta$.
Figure 3: Bias, variance, and MSE of the pseudoinverse estimator with $\beta=1$ and $\beta=4$ for different values of $w$ and $\beta^*\in\{1, 4\}$ averaged over 1000 trials. Note that the number of clusters is $m=n/w$. We see that the $\beta=1$ estimator has lower MSE in both settings: this is because the bias it incurs when $\beta^*=4$ is negligible compared to the additional variance incurred by the $\beta=4$ estimator.
Figure 4: Visualization of our theoretical RMSE bounds (dashed) and actual RMSE in simulation (solid) across a range of Louvain algorithm resolution parameters for the $\text{SBM}(0.5, 0)$ and $\text{SBM}(0.5, 0.2)$ graphs under a $\mathop{\mathrm{GCR}}\nolimits$ design and the $\mathop{\mathrm{TTE}}\nolimits_i=1$ and $\mathop{\mathrm{TTE}}\nolimits_i=0$ response models. We see that the qualitative shape of our bounds correctly reflects the dependence of the RMSE on the resolution parameter.
Figure 5: Monte Carlo error when estimating design matrices for $\mathop{\mathrm{GCR}}\nolimits$ designs. We see that our estimates converge to the desired quantity in large sample sizes.
...and 1 more figures

Theorems & Definitions (32)

Theorem 1: Re-statement of Corollaries \ref{['cor:gcr_unbiased']}, \ref{['cor:gcr_bias_bound']} and \ref{['thm:pi_gcr']}
Remark 1
Example 1
Lemma 1
Lemma 2
Theorem 2
Corollary 1
Corollary 2
Theorem 3
Remark 2
...and 22 more

Low-order outcomes and clustered designs: combining design and analysis for causal inference under network interference

TL;DR

Abstract

Low-order outcomes and clustered designs: combining design and analysis for causal inference under network interference

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (32)