Table of Contents
Fetching ...

Anti-causal domain generalization: Leveraging unlabeled data

Sorawit Saengkyongam, Juan L. Gamella, Andrew C. Miller, Jonas Peters, Nicolai Meinshausen, Christina Heinze-Deml

TL;DR

This work tackles domain generalization under an anti-causal regime where the outcome $Y$ drives the observed covariates $X$, allowing unlabeled data from multiple environments to inform how distributions shift. It introduces two regularizers, Mean-based Invariant Regularization (MIR) and Variance-based Invariant Regularization (VIR), which penalize sensitivity to mean and covariance shifts estimated from unlabeled data, establishing worst-case robustness guarantees for linear predictors and extending naturally to nonlinear representations. The authors provide population and plug-in estimators, prove consistency, and demonstrate empirical gains on a controlled Light Tunnel and the VitalDB stroke-volume dataset, especially when labeled environments are scarce. These methods enable robust performance under distribution shifts without requiring outcome labels across many environments, with practical impact in safety-critical settings like healthcare and physics-inspired sensing. The framework supports extensions to alternative losses and nonlinear models via representation learning, broadening applicability to high-dimensional, unstructured data.

Abstract

The problem of domain generalization concerns learning predictive models that are robust to distribution shifts when deployed in new, previously unseen environments. Existing methods typically require labeled data from multiple training environments, limiting their applicability when labeled data are scarce. In this work, we study domain generalization in an anti-causal setting, where the outcome causes the observed covariates. Under this structure, environment perturbations that affect the covariates do not propagate to the outcome, which motivates regularizing the model's sensitivity to these perturbations. Crucially, estimating these perturbation directions does not require labels, enabling us to leverage unlabeled data from multiple environments. We propose two methods that penalize the model's sensitivity to variations in the mean and covariance of the covariates across environments, respectively, and prove that these methods have worst-case optimality guarantees under certain classes of environments. Finally, we demonstrate the empirical performance of our approach on a controlled physical system and a physiological signal dataset.

Anti-causal domain generalization: Leveraging unlabeled data

TL;DR

This work tackles domain generalization under an anti-causal regime where the outcome drives the observed covariates , allowing unlabeled data from multiple environments to inform how distributions shift. It introduces two regularizers, Mean-based Invariant Regularization (MIR) and Variance-based Invariant Regularization (VIR), which penalize sensitivity to mean and covariance shifts estimated from unlabeled data, establishing worst-case robustness guarantees for linear predictors and extending naturally to nonlinear representations. The authors provide population and plug-in estimators, prove consistency, and demonstrate empirical gains on a controlled Light Tunnel and the VitalDB stroke-volume dataset, especially when labeled environments are scarce. These methods enable robust performance under distribution shifts without requiring outcome labels across many environments, with practical impact in safety-critical settings like healthcare and physics-inspired sensing. The framework supports extensions to alternative losses and nonlinear models via representation learning, broadening applicability to high-dimensional, unstructured data.

Abstract

The problem of domain generalization concerns learning predictive models that are robust to distribution shifts when deployed in new, previously unseen environments. Existing methods typically require labeled data from multiple training environments, limiting their applicability when labeled data are scarce. In this work, we study domain generalization in an anti-causal setting, where the outcome causes the observed covariates. Under this structure, environment perturbations that affect the covariates do not propagate to the outcome, which motivates regularizing the model's sensitivity to these perturbations. Crucially, estimating these perturbation directions does not require labels, enabling us to leverage unlabeled data from multiple environments. We propose two methods that penalize the model's sensitivity to variations in the mean and covariance of the covariates across environments, respectively, and prove that these methods have worst-case optimality guarantees under certain classes of environments. Finally, we demonstrate the empirical performance of our approach on a controlled physical system and a physiological signal dataset.
Paper Structure (60 sections, 5 theorems, 77 equations, 5 figures)

This paper contains 60 sections, 5 theorems, 77 equations, 5 figures.

Key Result

Theorem 4.1

Define Under Setting setting:anti-causal, we have

Figures (5)

  • Figure 1: Illustration of MIR. Blue points represent the training environment perturbation means $\mathop{\mathrm{\mathbb{E}}}\nolimits[\varepsilon_{e_i}]$, and the green star represents the test environment mean. The red ellipse represents the covariance structure of the regularization matrix $\mathrm{Var}(K)$, with red arrows indicating its (scaled) eigenvectors. The annotated arrows show the directions of the OLS and MIR solutions with different regularization strengths $\gamma$, as well as the optimal solution $\beta^{*}_{\mathrm{test}}$ for the test environment $e_{\operatorname{tst}}$.
  • Figure 2: Diagram of the light tunnel and the subset of variables used in our experiment: the outcomes or intervention targets red, green and blue, and the light-intensity measurements ir_1, vis_1, …, vis_3 used as predictors. Figure adapted from gamella2025causal, licensed under CC BY 4.0.
  • Figure 3: Performance on the Light Tunnel dataset. We show average RMSE (with standard errors) for leave-one-environment-out cross-validation across all outcome-intervention combinations. The x-axis indicates the number of environments with labeled observations; MIR uses unlabeled data from all training environments regardless of this number.
  • Figure 4: Performance on VitalDB dataset when all 128 training subjects have labeled data. (a) CVaR (average nMSE for subjects whose errors are above a given quantile) as a function of quantile threshold. (b) Distribution of per-subject Spearman's correlations between predicted and true stroke volume variations. VIR achieves improved robustness, especially on worse-performing subjects, while maintaining tracking performance comparable to baselines.
  • Figure 5: Performance on VitalDB dataset as a function of the number of labeled environments. VIR improves over baselines across all settings. The improvement is most pronounced when few labeled environments are available.

Theorems & Definitions (12)

  • Remark 3.2
  • Theorem 4.1: MIR Robustness
  • Theorem 4.2: VIR Robustness
  • Remark 4.3
  • Proposition 5.1: Consistency
  • Lemma 1.1: MSE Decomposition
  • proof
  • proof
  • Lemma 1.2
  • proof
  • ...and 2 more