Table of Contents
Fetching ...

Regression-Based Estimation of Causal Effects in the Presence of Selection Bias and Confounding

Marlies Hafer, Alexander Marx

TL;DR

The paper tackles estimating the causal effect $E[Y\mid do(X)]$ for continuous outcomes when data suffer from selection bias and confounding. It builds a formal DAG-based framework and derives identifiability and $s$-recoverability results by leveraging proxy variables and external data, even in the presence of unobserved confounding. The authors introduce the Two-Step Regression (TSR) estimator, which combines a regression on observed, biased data with a second stage using external data to adjust for selection while accounting for confounding; TSR reduces variance relative to prior regression-based approaches and extends naturally to non-linear settings via feature maps. Through extensive simulations and multiple DAG-based experiments, TSR is shown to be正确 recover causal effects under selection bias and confounding, with empirical evidence of improved variance properties and practical viability in scenarios such as loan-default risk assessment and other domains where costly labeling limits unbiased data collection.

Abstract

We consider the problem of estimating the expected causal effect $E[Y|do(X)]$ for a target variable $Y$ when treatment $X$ is set by intervention, focusing on continuous random variables. In settings without selection bias or confounding, $E[Y|do(X)] = E[Y|X]$, which can be estimated using standard regression methods. However, regression fails when systematic missingness induced by selection bias, or confounding distorts the data. Boeken et al. [2023] show that when training data is subject to selection, proxy variables unaffected by this process can, under certain constraints, be used to correct for selection bias to estimate $E[Y|X]$, and hence $E[Y|do(X)]$, reliably. When data is additionally affected by confounding, however, this equality is no longer valid. Building on these results, we consider a more general setting and propose a framework that incorporates both selection bias and confounding. Specifically, we derive theoretical conditions ensuring identifiability and recoverability of causal effects under access to external data and proxy variables. We further introduce a two-step regression estimator (TSR), capable of exploiting proxy variables to adjust for selection bias while accounting for confounding. We show that TSR coincides with prior work if confounding is absent, but achieves a lower variance. Extensive simulation studies validate TSR's correctness for scenarios which may include both selection bias and confounding with proxy variables.

Regression-Based Estimation of Causal Effects in the Presence of Selection Bias and Confounding

TL;DR

The paper tackles estimating the causal effect for continuous outcomes when data suffer from selection bias and confounding. It builds a formal DAG-based framework and derives identifiability and -recoverability results by leveraging proxy variables and external data, even in the presence of unobserved confounding. The authors introduce the Two-Step Regression (TSR) estimator, which combines a regression on observed, biased data with a second stage using external data to adjust for selection while accounting for confounding; TSR reduces variance relative to prior regression-based approaches and extends naturally to non-linear settings via feature maps. Through extensive simulations and multiple DAG-based experiments, TSR is shown to be正确 recover causal effects under selection bias and confounding, with empirical evidence of improved variance properties and practical viability in scenarios such as loan-default risk assessment and other domains where costly labeling limits unbiased data collection.

Abstract

We consider the problem of estimating the expected causal effect for a target variable when treatment is set by intervention, focusing on continuous random variables. In settings without selection bias or confounding, , which can be estimated using standard regression methods. However, regression fails when systematic missingness induced by selection bias, or confounding distorts the data. Boeken et al. [2023] show that when training data is subject to selection, proxy variables unaffected by this process can, under certain constraints, be used to correct for selection bias to estimate , and hence , reliably. When data is additionally affected by confounding, however, this equality is no longer valid. Building on these results, we consider a more general setting and propose a framework that incorporates both selection bias and confounding. Specifically, we derive theoretical conditions ensuring identifiability and recoverability of causal effects under access to external data and proxy variables. We further introduce a two-step regression estimator (TSR), capable of exploiting proxy variables to adjust for selection bias while accounting for confounding. We show that TSR coincides with prior work if confounding is absent, but achieves a lower variance. Extensive simulation studies validate TSR's correctness for scenarios which may include both selection bias and confounding with proxy variables.

Paper Structure

This paper contains 30 sections, 9 theorems, 48 equations, 99 figures, 8 tables.

Key Result

Theorem 3.5

Under Assumption ass:new, the causal effect $E[Y\mid do(X)]$ is identifiable, s-recoverable and can be expressed as follows

Figures (99)

  • Figure 3: Quadratic model: Top-Left: Boxplots of the MSE over $\mathcal{D}$ for $\mathcal{S}\subset\mathcal{D}$, and $\mathcal{S}\cap\mathcal{D}=\emptyset$ (bottom-left) of RR and TSR for $n\in\{500,1000,5000\}$. The plots to the right side, respectively, show the associated $95\%$-areas of naive, RR and TSR for $n=500$. The upper boxplot in these figures represents the distribution of X in $\mathcal{S}$ and the lower in $\mathcal{D}$. The curves for RR and TSR display the mean estimation over the simulation runs.
  • Figure 6: Comparison of the central $95\%$-areas of RR and TSR of the simulation runs for the DAG in Figure \ref{['fig:vierDAGs']} (d) with sample size $n=500$ in the setting with $\mathcal{S}\cap\mathcal{D}=\emptyset$. The upper boxplot represents the distribution of $X$ in $\mathcal{S}$ and the lower in $\mathcal{D}$. The curves for RR and TSR display the mean estimation over the simulation runs.
  • Figure 17: Comparison of the central $95\%$-areas of RR and TSR of the simulation runs for the DAG in Figure \ref{['fig:vierDAGs']} (d) with sample size $n=2000$. The upper boxplot represents the distribution of $X$ in $\mathcal{S}$ and the lower in $\mathcal{D}$ ($\mathcal{S}\cap\mathcal{D}=\emptyset$). The curves for RR and TSR display the mean estimation over the simulation runs.
  • Figure : $\mathcal{S}$
  • Figure : (a)
  • ...and 94 more figures

Theorems & Definitions (19)

  • Definition 3.1: s-recoverability
  • Definition 3.2: s-recoverability with external data
  • Theorem 3.5
  • proof
  • Theorem 3.7
  • Corollary 3.8
  • proof
  • Theorem 3.10
  • Definition A.1: rules of do-calculus
  • Theorem A.3: Selection backdoor adjustment bareinboim:14:recovering-selection
  • ...and 9 more