Unsupervised domain adaptation under hidden confounding
Carlos García Meixide, David Ríos Insua
TL;DR
This work tackles prediction under distribution shift with hidden confounding across multiple data sources by introducing Generative Invariance (GI), a framework that jointly exploits invariance for identifiability and a target-domain generative model to replicate the test environment. GI yields estimators (β̂, K̂) with closed-form expressions and strong theoretical properties, including concentration and asymptotic normality, enabling optimal predictions without worst-case perturbation assumptions. The authors derive identifiability results for both univariate and multivariate feature settings, provide empirical GI via plug-in estimators, and demonstrate superior predictive performance and distributional alignment through extensive simulations and a cardiovascular medicine dataset (SPRINT). The approach offers practical benefits for unsupervised domain adaptation under hidden confounding, with implications for robust predictive modeling in healthcare and beyond, and suggests avenues for high-dimensional and nonlinear extensions as well as Bayesian interpretations.
Abstract
We introduce a new predictive mechanism that operates in the presence of hidden confounding across distributionally diverse data sources while ensuring consistent estimation of causal parameters-despite their recognized suboptimality for prediction in the literature. Our method is based on a novel estimand that captures the dependence structure between response noise and covariates, incorporating causal parameters into a generative model that adaptively replicates the conditional distribution of the test environment. Identifiability is achieved under a straightforward, empirically verifiable assumption. Our approach ensures probabilistic alignment with test distributions uniformly across arbitrary interventions, enabling valid predictions without requiring worst-case optimization or assumptions about the strength of perturbations at test time. Through extensive simulations, we demonstrate that our method outperforms state-of-the-art invariance-based and domain adaptation approaches. Additionally, we validate its practical applicability and superior target risk performance on a cardiovascular disease dataset.
