Table of Contents
Fetching ...

Unsupervised domain adaptation under hidden confounding

Carlos García Meixide, David Ríos Insua

TL;DR

This work tackles prediction under distribution shift with hidden confounding across multiple data sources by introducing Generative Invariance (GI), a framework that jointly exploits invariance for identifiability and a target-domain generative model to replicate the test environment. GI yields estimators (β̂, K̂) with closed-form expressions and strong theoretical properties, including concentration and asymptotic normality, enabling optimal predictions without worst-case perturbation assumptions. The authors derive identifiability results for both univariate and multivariate feature settings, provide empirical GI via plug-in estimators, and demonstrate superior predictive performance and distributional alignment through extensive simulations and a cardiovascular medicine dataset (SPRINT). The approach offers practical benefits for unsupervised domain adaptation under hidden confounding, with implications for robust predictive modeling in healthcare and beyond, and suggests avenues for high-dimensional and nonlinear extensions as well as Bayesian interpretations.

Abstract

We introduce a new predictive mechanism that operates in the presence of hidden confounding across distributionally diverse data sources while ensuring consistent estimation of causal parameters-despite their recognized suboptimality for prediction in the literature. Our method is based on a novel estimand that captures the dependence structure between response noise and covariates, incorporating causal parameters into a generative model that adaptively replicates the conditional distribution of the test environment. Identifiability is achieved under a straightforward, empirically verifiable assumption. Our approach ensures probabilistic alignment with test distributions uniformly across arbitrary interventions, enabling valid predictions without requiring worst-case optimization or assumptions about the strength of perturbations at test time. Through extensive simulations, we demonstrate that our method outperforms state-of-the-art invariance-based and domain adaptation approaches. Additionally, we validate its practical applicability and superior target risk performance on a cardiovascular disease dataset.

Unsupervised domain adaptation under hidden confounding

TL;DR

This work tackles prediction under distribution shift with hidden confounding across multiple data sources by introducing Generative Invariance (GI), a framework that jointly exploits invariance for identifiability and a target-domain generative model to replicate the test environment. GI yields estimators (β̂, K̂) with closed-form expressions and strong theoretical properties, including concentration and asymptotic normality, enabling optimal predictions without worst-case perturbation assumptions. The authors derive identifiability results for both univariate and multivariate feature settings, provide empirical GI via plug-in estimators, and demonstrate superior predictive performance and distributional alignment through extensive simulations and a cardiovascular medicine dataset (SPRINT). The approach offers practical benefits for unsupervised domain adaptation under hidden confounding, with implications for robust predictive modeling in healthcare and beyond, and suggests avenues for high-dimensional and nonlinear extensions as well as Bayesian interpretations.

Abstract

We introduce a new predictive mechanism that operates in the presence of hidden confounding across distributionally diverse data sources while ensuring consistent estimation of causal parameters-despite their recognized suboptimality for prediction in the literature. Our method is based on a novel estimand that captures the dependence structure between response noise and covariates, incorporating causal parameters into a generative model that adaptively replicates the conditional distribution of the test environment. Identifiability is achieved under a straightforward, empirically verifiable assumption. Our approach ensures probabilistic alignment with test distributions uniformly across arbitrary interventions, enabling valid predictions without requiring worst-case optimization or assumptions about the strength of perturbations at test time. Through extensive simulations, we demonstrate that our method outperforms state-of-the-art invariance-based and domain adaptation approaches. Additionally, we validate its practical applicability and superior target risk performance on a cardiovascular disease dataset.
Paper Structure (30 sections, 10 theorems, 86 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 30 sections, 10 theorems, 86 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

theorem 1

If $\space \mathbb{E}X_1\neq 0$ then $(\beta_{opt}, K_{opt})$ exists, is unique, $\beta_{opt}= \beta_* \textrm{ and }K_{opt} \operatorname{Var} X_1= K_*.$

Figures (7)

  • Figure 1: (Simpson's paradox) Although IV accurately identifies the causal parameter (positive slope), it proves entirely suboptimal for making predictions in the test environment. Our predictions, in red, align perfectly with the test source.
  • Figure 2: Sample from training environment $\mathbb{P}^1$ in blue. Samples from 5 different test environments in black. We fit an ordinary linear model to the blue data and launch red predictions. Our predictions, coloured in green, match the test distributions.
  • Figure 3: GI acting on SPRINT trial data. Left: data from hospital 48 in black dots and hospital 22 in white squares. Right: the predictions given by our estimator (in blue) for the new unseen hospital 15. Our approach does not see the blue triangles, just the blue values in the horizontal axis.
  • Figure 4: Test MSEs for varying test perturbation strengths. Notice the log scale in both axes, especially in the vertical one. GI's predictions MSE is always lower for any value of DRIG's tuning hyperparameter.
  • Figure 5: Performance of three generative models in emulating a test distribution indexed by the horizontal axis. Energy distance between GI's predictions and true test labels is uniformly almost zero.
  • ...and 2 more figures

Theorems & Definitions (22)

  • remark 1
  • theorem 1
  • remark 2
  • theorem 2
  • theorem 3: Properties of $H_Z$ and $M_Z$
  • proof
  • remark 3
  • theorem 4: GI closed-form estimators
  • theorem 5: Finite-sample bound
  • theorem 6: Asymptotic normality of $\hat{\beta}$
  • ...and 12 more