Table of Contents
Fetching ...

Constructing Confidence Intervals for Average Treatment Effects from Multiple Datasets

Yuxin Wang, Maresa Schröder, Dennis Frauen, Jonas Schweisthal, Konstantin Hess, Stefan Feuerriegel

TL;DR

This work tackles constructing valid confidence intervals for the average treatment effect (ATE) when combining multiple observational datasets with differing confounding structures. It introduces prediction-powered inference (PPI), which couples a measure-of-fit from a large, potentially confounded dataset with a rectifier learned from a smaller, less biased source to shrink CI width while maintaining asymptotic validity. The method provides a coherent framework for both observational-only and RCT+observational settings, with theoretical guarantees and practical demonstrations on synthetic and medical data that show faithful coverage and substantially narrower CIs than naïve baselines. Overall, the approach enables more precise, reliable uncertainty quantification for multi-source causal evidence in medical contexts, and it accommodates flexible modeling choices, including pre-trained predictors.

Abstract

Constructing confidence intervals (CIs) for the average treatment effect (ATE) from patient records is crucial to assess the effectiveness and safety of drugs. However, patient records typically come from different hospitals, thus raising the question of how multiple observational datasets can be effectively combined for this purpose. In our paper, we propose a new method that estimates the ATE from multiple observational datasets and provides valid CIs. Our method makes little assumptions about the observational datasets and is thus widely applicable in medical practice. The key idea of our method is that we leverage prediction-powered inferences and thereby essentially `shrink' the CIs so that we offer more precise uncertainty quantification as compared to naïve approaches. We further prove the unbiasedness of our method and the validity of our CIs. We confirm our theoretical results through various numerical experiments. Finally, we provide an extension of our method for constructing CIs from combinations of experimental and observational datasets.

Constructing Confidence Intervals for Average Treatment Effects from Multiple Datasets

TL;DR

This work tackles constructing valid confidence intervals for the average treatment effect (ATE) when combining multiple observational datasets with differing confounding structures. It introduces prediction-powered inference (PPI), which couples a measure-of-fit from a large, potentially confounded dataset with a rectifier learned from a smaller, less biased source to shrink CI width while maintaining asymptotic validity. The method provides a coherent framework for both observational-only and RCT+observational settings, with theoretical guarantees and practical demonstrations on synthetic and medical data that show faithful coverage and substantially narrower CIs than naïve baselines. Overall, the approach enables more precise, reliable uncertainty quantification for multi-source causal evidence in medical contexts, and it accommodates flexible modeling choices, including pre-trained predictors.

Abstract

Constructing confidence intervals (CIs) for the average treatment effect (ATE) from patient records is crucial to assess the effectiveness and safety of drugs. However, patient records typically come from different hospitals, thus raising the question of how multiple observational datasets can be effectively combined for this purpose. In our paper, we propose a new method that estimates the ATE from multiple observational datasets and provides valid CIs. Our method makes little assumptions about the observational datasets and is thus widely applicable in medical practice. The key idea of our method is that we leverage prediction-powered inferences and thereby essentially `shrink' the CIs so that we offer more precise uncertainty quantification as compared to naïve approaches. We further prove the unbiasedness of our method and the validity of our CIs. We confirm our theoretical results through various numerical experiments. Finally, we provide an extension of our method for constructing CIs from combinations of experimental and observational datasets.

Paper Structure

This paper contains 36 sections, 3 theorems, 52 equations, 14 figures, 2 tables, 2 algorithms.

Key Result

Lemma 4.1

Let $\textcolor{NavyBlue}{\mathcal{D}^1}$ and $\textcolor{ForestGreen}{\mathcal{D}^2}$ be sampled i.i.d under the assumptions above. Assume that we have estimated CATE estimator $\textcolor{ForestGreen}{\hat{\tau}_2(x)}$ with sample splitting on $\textcolor{ForestGreen}{\mathcal{D}^2}$, and have con

Figures (14)

  • Figure 1: Key works aimed at ATE estimation from multiple datasets.
  • Figure 2: Overview of our method. To construct CIs for the ATE with two observational datasets from the same population but with different assumptions, we leverage prediction-powered inferences: we decompose our task into computing a measure of fit (i.e., estimating the ATE on the large dataset $\textcolor{ForestGreen}{\mathcal{D}^2}$ via the DR-learner, given by $\textcolor{ForestGreen}{\hat{\tau}_2(x)}$) and a rectifier $\hat{\Delta}_\tau$ (i.e., that measures the differences in ATE estimates across both datasets $\textcolor{NavyBlue}{\mathcal{D}^1}$ and $\textcolor{ForestGreen}{\mathcal{D}^2}$). However, finding a rectifier for our task is non-trivial and requires a careful derivation in order to ensure asymptotically valid CIs ($\rightarrow$ our Theorem \ref{['thm:validity']}).
  • Figure 3: Performance for synthetic data.Left: We show the estimated CIs for five random seeds. The red line is the oracle ATE. Ideally, the CIs should be narrow but still overlap with the oracle ATE. Right: Shows in the width of the CIs averaged over five different seeds ($\alpha = 0.05$). Here, we vary the size of the different datasets given by $n$ ($\textcolor{NavyBlue}{\mathcal{D}^1}$) and $N$ ($\textcolor{ForestGreen}{\mathcal{D}^2}$). Note that $\hat{\tau}^{\textrm{AIPW}}$ ($\textcolor{ForestGreen}{\mathcal{D}^2}$ only) is shown in intentionally shown in gray: it is not faithful as seen in the left plot and therefore not a valid baseline. $\Rightarrow$ Our method yields faithful CIs, and CIs are shorter as desired.
  • Figure 4: Performance for synthetic data.Left: We show the estimated CIs for five different seeds in RCT and observational datasets. Right: We show the width of the CIs averaged over five different seeds ($\alpha = 0.05$). $\Rightarrow$ Our method is both stable and leads to CIs that are faithful and narrow, as desired.
  • Figure 5: Results for MLP as regression method.
  • ...and 9 more figures

Theorems & Definitions (8)

  • Lemma 4.1: follows from Wager.2024
  • proof
  • Theorem 4.2: Validity of our prediction-powered CIs
  • proof
  • Remark 5.1
  • Theorem 5.2: Validity of our prediction-powered CIs in RCT+observational setting
  • proof
  • proof