Table of Contents
Fetching ...

RCT Rejection Sampling for Causal Estimation Evaluation

Katherine A. Keith, Sergey Feldman, David Jurgens, Jonathan Bragg, Rohit Bhattacharya

TL;DR

The paper tackles the lack of robust empirical benchmarks for high-dimensional causal estimation by proposing RCT subsampling, and introduces a theoretically guaranteed RCT rejection sampling method to produce observational data with identifiable causal effects. It provides formal identification arguments, contrasts with non-identification results from prior work, and demonstrates substantial bias reduction and appropriate interval coverage in synthetic experiments. A proof-of-concept pipeline using a real-world large RCT with text covariates showcases practical considerations, modeling choices, and diagnostics, and the authors release data and code to support reproducibility. Overall, the work advances empirical evaluation for causal estimators and lays groundwork for broader benchmarks to assess methods under realistic, high-dimensional confounding scenarios.

Abstract

Confounding is a significant obstacle to unbiased estimation of causal effects from observational data. For settings with high-dimensional covariates -- such as text data, genomics, or the behavioral social sciences -- researchers have proposed methods to adjust for confounding by adapting machine learning methods to the goal of causal estimation. However, empirical evaluation of these adjustment methods has been challenging and limited. In this work, we build on a promising empirical evaluation strategy that simplifies evaluation design and uses real data: subsampling randomized controlled trials (RCTs) to create confounded observational datasets while using the average causal effects from the RCTs as ground-truth. We contribute a new sampling algorithm, which we call RCT rejection sampling, and provide theoretical guarantees that causal identification holds in the observational data to allow for valid comparisons to the ground-truth RCT. Using synthetic data, we show our algorithm indeed results in low bias when oracle estimators are evaluated on the confounded samples, which is not always the case for a previously proposed algorithm. In addition to this identification result, we highlight several finite data considerations for evaluation designers who plan to use RCT rejection sampling on their own datasets. As a proof of concept, we implement an example evaluation pipeline and walk through these finite data considerations with a novel, real-world RCT -- which we release publicly -- consisting of approximately 70k observations and text data as high-dimensional covariates. Together, these contributions build towards a broader agenda of improved empirical evaluation for causal estimation.

RCT Rejection Sampling for Causal Estimation Evaluation

TL;DR

The paper tackles the lack of robust empirical benchmarks for high-dimensional causal estimation by proposing RCT subsampling, and introduces a theoretically guaranteed RCT rejection sampling method to produce observational data with identifiable causal effects. It provides formal identification arguments, contrasts with non-identification results from prior work, and demonstrates substantial bias reduction and appropriate interval coverage in synthetic experiments. A proof-of-concept pipeline using a real-world large RCT with text covariates showcases practical considerations, modeling choices, and diagnostics, and the authors release data and code to support reproducibility. Overall, the work advances empirical evaluation for causal estimators and lays groundwork for broader benchmarks to assess methods under realistic, high-dimensional confounding scenarios.

Abstract

Confounding is a significant obstacle to unbiased estimation of causal effects from observational data. For settings with high-dimensional covariates -- such as text data, genomics, or the behavioral social sciences -- researchers have proposed methods to adjust for confounding by adapting machine learning methods to the goal of causal estimation. However, empirical evaluation of these adjustment methods has been challenging and limited. In this work, we build on a promising empirical evaluation strategy that simplifies evaluation design and uses real data: subsampling randomized controlled trials (RCTs) to create confounded observational datasets while using the average causal effects from the RCTs as ground-truth. We contribute a new sampling algorithm, which we call RCT rejection sampling, and provide theoretical guarantees that causal identification holds in the observational data to allow for valid comparisons to the ground-truth RCT. Using synthetic data, we show our algorithm indeed results in low bias when oracle estimators are evaluated on the confounded samples, which is not always the case for a previously proposed algorithm. In addition to this identification result, we highlight several finite data considerations for evaluation designers who plan to use RCT rejection sampling on their own datasets. As a proof of concept, we implement an example evaluation pipeline and walk through these finite data considerations with a novel, real-world RCT -- which we release publicly -- consisting of approximately 70k observations and text data as high-dimensional covariates. Together, these contributions build towards a broader agenda of improved empirical evaluation for causal estimation.
Paper Structure (31 sections, 2 theorems, 10 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 31 sections, 2 theorems, 10 equations, 6 figures, 6 tables, 1 algorithm.

Key Result

Proposition 3.1

Given $n$ iid samples from a distribution $P$ that is Markov relative to Fig. fig:selection-bias(a), Algorithm 2 in gentzel2021and draws samples according to a distribution $P^*$ such that condition (II) is not satisfied.

Figures (6)

  • Figure 1: Causal DAGs (a) corresponding to an RCT; (b) representing a sampling procedure; (c) corresponding to an observational study where $C$ satisfies the backdoor criterion.
  • Figure 2: Proof of concept approach. Left figure. Causal DAG for the proxy strategy. The blue edges are confirmed to empirically exist in the finite dataset. The red edge is selected by the evaluation designers via $P^*(T|C)$. Right table. RCT dataset descriptive statistics including the number of units in the population/subpopulation ($n$) and the odds ratio, $OR(C, Y)$.
  • Figure 3: Diagnostic plots for proof of concept pipeline. For Subpopulation A data, each plot is the parameterization of $P^{*}(T|C)$ in Equation \ref{['eqn:our-p-star']}, which is specified by the evaluation designer. Each blue circle is a different random seed (100 seeds total per plot/parameterization).
  • Figure 4: Diagnostic plot for Subpopulation B. Each plot is the parameterization of the researcher-specified confounding functions, $P^{*}(T|C)$ in Equation \ref{['eqn:our-p-star']}. Each blue dot is a different random seed (100 seeds total per plot/parameterization).
  • Figure 5: For Synthetic DGP #1 (single random seed) and 1000 bootstrap samples, we plot the 95% confidence intervals for the original RCT (difference in means estimator), RCT rejection sampling with a parametric adjustment (with knowledge of the oracle adjustment), and Algorithm 2 from gentzel2021and with a parametric adjustment (with knowledge of the oracle adjustment). The mean of the bootstrap samples is denoted by the dot.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Proposition 3.1
  • Theorem 3.2
  • proof
  • proof