A Double Machine Learning Approach to Combining Experimental and Observational Data

Harsh Parikh; Marco Morucci; Vittorio Orlandi; Sudeepa Roy; Cynthia Rudin; Alexander Volfovsky

A Double Machine Learning Approach to Combining Experimental and Observational Data

Harsh Parikh, Marco Morucci, Vittorio Orlandi, Sudeepa Roy, Cynthia Rudin, Alexander Volfovsky

TL;DR

The paper tackles the challenge of estimating population treatment effects using both experimental and observational data when external validity or ignorability may be violated. It introduces a double machine learning framework with cross-fitting, efficient influence functions, and a falsification test based on the statistic $ heta(t)$ to detect violations of A3 or A4. A key theoretical contribution is the impossibility of a truly doubly resilient estimator under unknown violations, motivating the proposed two-stage estimators for $ heta(t)$ and $ u(t)$ that are root-$n$ consistent and provide valid confidence intervals. The authors validate their approach with synthetic data and three real-world applications (STAR, CASS, Lalonde NSW/PSID), demonstrating improved data fusion performance, interpretable tests for validity, and practical guidance for empirical causal inference. Overall, the work offers a principled, scalable toolkit for leveraging mixed data sources while diagnosing and adjusting for potential violations in causal identification.

Abstract

Experimental and observational studies often lack validity due to untestable assumptions. We propose a double machine learning approach to combine experimental and observational studies, allowing practitioners to test for assumption violations and estimate treatment effects consistently. Our framework proposes a falsification test for external validity and ignorability under milder assumptions. We provide consistent treatment effect estimators even when one of the assumptions is violated. However, our no-free-lunch theorem highlights the necessity of accurately identifying the violated assumption for consistent treatment effect estimation. Through comparative analyses, we show our framework's superiority over existing data fusion methods. The practical utility of our approach is further exemplified by three real-world case studies, underscoring its potential for widespread application in empirical research.

A Double Machine Learning Approach to Combining Experimental and Observational Data

TL;DR

to detect violations of A3 or A4. A key theoretical contribution is the impossibility of a truly doubly resilient estimator under unknown violations, motivating the proposed two-stage estimators for

and

that are root-

consistent and provide valid confidence intervals. The authors validate their approach with synthetic data and three real-world applications (STAR, CASS, Lalonde NSW/PSID), demonstrating improved data fusion performance, interpretable tests for validity, and practical guidance for empirical causal inference. Overall, the work offers a principled, scalable toolkit for leveraging mixed data sources while diagnosing and adjusting for potential violations in causal identification.

Abstract

Paper Structure (57 sections, 14 theorems, 88 equations, 11 figures, 2 tables)

This paper contains 57 sections, 14 theorems, 88 equations, 11 figures, 2 tables.

Introduction
Literature Review
Preliminaries
Causal Inference: Notation
Causal Inference: Assumptions
Discussion of Assumptions
Impossibility of Double Resilience
Falsification Test for External Validity and Conditional Ignorability
An Estimator for Confounding
Treatment Effect Estimation
Efficiency Bound & Asymptotic Variance
Estimation under External Validity (A3)
Student Teacher Achievement Ratio (STAR) Project
Data Description
Analysis and Result
...and 42 more sections

Key Result

Theorem 2

There does not exist any doubly resilient estimator $g_{DR}(t, \mathbf{X})$.

Figures (11)

Figure 1: Causal DAGs. Panel (a): potential relationships without any assumptions. Panel (b): variable relationships under A1, A2, A3, and A5 while A4 is explicitly violated. In this case, TE identification in the experimental sample can be understood as a special case of instrumental variable identification where $S$ is the instrument. Panel (c): variable relationships under A1, A2, A4, A5, while A3 is explicitly violated. In this case, the relationship between U, T, and Y is the same in both the experimental and observational populations.
Figure 2: Kernel Density Estimate of the distribution of $\hat{\theta}(1)$ and $\hat{\theta}(0)$. The majority of mass for each of the density functions is in the $>0$ region with minimal density around 0.
Figure 3: Confounding analysis to estimate the level of bias affecting the selection into small classrooms. Potential estimates of level of selection bias are all values of $\alpha$'s for which the test fails to reject the null hypothesis. We find that the null hypothesis is not for $3\leq \alpha \leq 29$ and for $\alpha=16$, the p-value $p(1)$ peaks.
Figure 4: Kernel density plot for $\theta(0)$ and $\theta(1)$ for the CASS dataset.
Figure 5:
...and 6 more figures

Theorems & Definitions (16)

Definition 1: Double Resilience
Theorem 2: Doubly Resilient Estimators Do Not Exist
Theorem 3: Identification
Lemma 1
Theorem 4
Theorem 5: EIF under A3
Theorem 6
Corollary 7
Definition 1: Double Resilience
Lemma 2: Violated Assumptions Imply Selection Bias Unidentifiability
...and 6 more

A Double Machine Learning Approach to Combining Experimental and Observational Data

TL;DR

Abstract

A Double Machine Learning Approach to Combining Experimental and Observational Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (16)