Table of Contents
Fetching ...

The Hardness of Validating Observational Studies with Experimental Data

Jake Fawkes, Michael O'Riordan, Athanasios Vlontzos, Oriol Corcoll, Ciarán Mark Gilligan-Lee

TL;DR

This work investigates the fundamental limits of validating observational causal estimates with experimental data, arguing that impossible inference prevents universal validation of heterogeneous treatment effects without additional smoothness assumptions. It shows that experimental data can falsify biased observational CATE estimates but cannot guarantee correction without structure on the correction function. To address this, the authors introduce a Gaussian Process-based approach that models the correction term and a nuisance function within a multitask GP, producing uniform error bounds and credible intervals for the CATE across the observational support using pseudo-outcomes. Empirical results on simulated and semi-synthetic data demonstrate improved predictive performance and well-calibrated uncertainty relative to baselines, with strong extrapolation capabilities beyond the experimental support. The approach offers practical inference tools for combining observational and experimental data while clarifying the role of smoothness assumptions in causal sensitivity analysis.

Abstract

Observational data is often readily available in large quantities, but can lead to biased causal effect estimates due to the presence of unobserved confounding. Recent works attempt to remove this bias by supplementing observational data with experimental data, which, when available, is typically on a smaller scale due to the time and cost involved in running a randomised controlled trial. In this work, we prove a theorem that places fundamental limits on this ``best of both worlds'' approach. Using the framework of impossible inference, we show that although it is possible to use experimental data to \emph{falsify} causal effect estimates from observational data, in general it is not possible to \emph{validate} such estimates. Our theorem proves that while experimental data can be used to detect bias in observational studies, without additional assumptions on the smoothness of the correction function, it can not be used to remove it. We provide a practical example of such an assumption, developing a novel Gaussian Process based approach to construct intervals which contain the true treatment effect with high probability, both inside and outside of the support of the experimental data. We demonstrate our methodology on both simulated and semi-synthetic datasets and make the \href{https://github.com/Jakefawkes/Obs_and_exp_data}{code available}.

The Hardness of Validating Observational Studies with Experimental Data

TL;DR

This work investigates the fundamental limits of validating observational causal estimates with experimental data, arguing that impossible inference prevents universal validation of heterogeneous treatment effects without additional smoothness assumptions. It shows that experimental data can falsify biased observational CATE estimates but cannot guarantee correction without structure on the correction function. To address this, the authors introduce a Gaussian Process-based approach that models the correction term and a nuisance function within a multitask GP, producing uniform error bounds and credible intervals for the CATE across the observational support using pseudo-outcomes. Empirical results on simulated and semi-synthetic data demonstrate improved predictive performance and well-calibrated uncertainty relative to baselines, with strong extrapolation capabilities beyond the experimental support. The approach offers practical inference tools for combining observational and experimental data while clarifying the role of smoothness assumptions in causal sensitivity analysis.

Abstract

Observational data is often readily available in large quantities, but can lead to biased causal effect estimates due to the presence of unobserved confounding. Recent works attempt to remove this bias by supplementing observational data with experimental data, which, when available, is typically on a smaller scale due to the time and cost involved in running a randomised controlled trial. In this work, we prove a theorem that places fundamental limits on this ``best of both worlds'' approach. Using the framework of impossible inference, we show that although it is possible to use experimental data to \emph{falsify} causal effect estimates from observational data, in general it is not possible to \emph{validate} such estimates. Our theorem proves that while experimental data can be used to detect bias in observational studies, without additional assumptions on the smoothness of the correction function, it can not be used to remove it. We provide a practical example of such an assumption, developing a novel Gaussian Process based approach to construct intervals which contain the true treatment effect with high probability, both inside and outside of the support of the experimental data. We demonstrate our methodology on both simulated and semi-synthetic datasets and make the \href{https://github.com/Jakefawkes/Obs_and_exp_data}{code available}.

Paper Structure

This paper contains 37 sections, 6 theorems, 52 equations, 4 figures, 13 tables.

Key Result

Theorem 3.4

Fix any $\underaccent{\bar{}}{f},\bar{f}: {\mathcal{X}} \to \mathbb{R}$ and let $\psi_n$ be an equivalence test with null $Q_{M}(\underaccent{\bar{}}{f},\bar{f})$ and alternative ${\mathcal{P}}_{M,\pi}(\underaccent{\bar{}}{f},\bar{f})$. If the level of this test is, $\alpha$ we have that: for any $P \in {\mathcal{P}}_{M,\pi}(\underaccent{\bar{}}{f},\bar{f})$. That is $\psi_n$ does not have power

Figures (4)

  • Figure 1: Causal Structure for generating the experimental and observational datasets. Dashed edges are only present in the observational dataset, whilst all others are present and fixed across both datasets.
  • Figure 2: Illustration of both the sets of distributions and proof of the technical result in Section \ref{['sec:hardness_validating']}. The first figure demonstrates the sets of distributions ${\mathcal{P}}_{M,\pi}(\underaccent{\bar{}}{f},\bar{f}),{\mathcal{Q}}_{M,\pi}(\underaccent{\bar{}}{f},\bar{f})$. ${\mathcal{P}}_{M,\pi}(\underaccent{\bar{}}{f},\bar{f})$ is the set of distributions where $\Delta(\cdot)$ is always contained in the blue region, and ${\mathcal{Q}}_{M,\pi}(\underaccent{\bar{}}{f},\bar{f})$ is the set of all other distributions, so those where $\Delta(\cdot)$ leaves the blue region. To prove the hardness of validating observational study estimates, we show that for any $P \in {\mathcal{P}}_{M,\pi}(\underaccent{\bar{}}{f},\bar{f})$ we can find distributions $Q \in {\mathcal{Q}}_{M,\pi}(\underaccent{\bar{}}{f},\bar{f})$ that are arbitrarily close by adding spikes as in Figure b.
  • Figure 3: A particularly pathological example of the behaviour we observe for each method in our simulated experiment of Section \ref{['sec:simulated_experiment']}. For the standard GP, hyperparameter optimisation leads to uninformative predictions as it cannot account for close ${\mathbf x}$ values with seemingly no correlation. For the trained LCM, we get strong predictive performance but poor uncertainty quantification, especially out of distribution. Our approach gets the best of both scenarios, with strong predictive performance and calibrated uncertainty out of distribution.
  • Figure 4: Causal Structure for generating the experimental and observational datasets with environment node drawn in.

Theorems & Definitions (17)

  • Definition 2.1: IPW Pseudo-Outcome
  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Theorem 3.4
  • Corollary 3.5
  • Proposition 3.6
  • Definition 3.7
  • Theorem 3.8
  • Proposition 4.1
  • ...and 7 more