Table of Contents
Fetching ...

Modeling the Data-Generating Process is Necessary for Out-of-Distribution Generalization

Jivat Neet Kaur, Emre Kiciman, Amit Sharma

TL;DR

The paper addresses domain generalization under realistic multi-attribute distribution shifts by grounding generalization in the data-generating process (DGP) through a canonical causal graph. It proves that no single fixed conditional independence constraint can generalize across all shift types, motivating an adaptive approach. The authors introduce Causally Adaptive Constraint Minimization (CACM), which derives environment- and relation-specific independence constraints via d-separation and enforces them with MMD-based regularization on learned representations. Empirical results across MNIST variants, small NORB, and Waterbirds show CACM achieving the highest unseen-domain and worst-group accuracy on multi-attribute shifts and robust performance on individual shifts, while incorrect constraints degrade performance. This work highlights the importance of modeling causal structure in the DGP for robust out-of-distribution generalization.

Abstract

Recent empirical studies on domain generalization (DG) have shown that DG algorithms that perform well on some distribution shifts fail on others, and no state-of-the-art DG algorithm performs consistently well on all shifts. Moreover, real-world data often has multiple distribution shifts over different attributes; hence we introduce multi-attribute distribution shift datasets and find that the accuracy of existing DG algorithms falls even further. To explain these results, we provide a formal characterization of generalization under multi-attribute shifts using a canonical causal graph. Based on the relationship between spurious attributes and the classification label, we obtain realizations of the canonical causal graph that characterize common distribution shifts and show that each shift entails different independence constraints over observed variables. As a result, we prove that any algorithm based on a single, fixed constraint cannot work well across all shifts, providing theoretical evidence for mixed empirical results on DG algorithms. Based on this insight, we develop Causally Adaptive Constraint Minimization (CACM), an algorithm that uses knowledge about the data-generating process to adaptively identify and apply the correct independence constraints for regularization. Results on fully synthetic, MNIST, small NORB, and Waterbirds datasets, covering binary and multi-valued attributes and labels, show that adaptive dataset-dependent constraints lead to the highest accuracy on unseen domains whereas incorrect constraints fail to do so. Our results demonstrate the importance of modeling the causal relationships inherent in the data-generating process.

Modeling the Data-Generating Process is Necessary for Out-of-Distribution Generalization

TL;DR

The paper addresses domain generalization under realistic multi-attribute distribution shifts by grounding generalization in the data-generating process (DGP) through a canonical causal graph. It proves that no single fixed conditional independence constraint can generalize across all shift types, motivating an adaptive approach. The authors introduce Causally Adaptive Constraint Minimization (CACM), which derives environment- and relation-specific independence constraints via d-separation and enforces them with MMD-based regularization on learned representations. Empirical results across MNIST variants, small NORB, and Waterbirds show CACM achieving the highest unseen-domain and worst-group accuracy on multi-attribute shifts and robust performance on individual shifts, while incorrect constraints degrade performance. This work highlights the importance of modeling causal structure in the DGP for robust out-of-distribution generalization.

Abstract

Recent empirical studies on domain generalization (DG) have shown that DG algorithms that perform well on some distribution shifts fail on others, and no state-of-the-art DG algorithm performs consistently well on all shifts. Moreover, real-world data often has multiple distribution shifts over different attributes; hence we introduce multi-attribute distribution shift datasets and find that the accuracy of existing DG algorithms falls even further. To explain these results, we provide a formal characterization of generalization under multi-attribute shifts using a canonical causal graph. Based on the relationship between spurious attributes and the classification label, we obtain realizations of the canonical causal graph that characterize common distribution shifts and show that each shift entails different independence constraints over observed variables. As a result, we prove that any algorithm based on a single, fixed constraint cannot work well across all shifts, providing theoretical evidence for mixed empirical results on DG algorithms. Based on this insight, we develop Causally Adaptive Constraint Minimization (CACM), an algorithm that uses knowledge about the data-generating process to adaptively identify and apply the correct independence constraints for regularization. Results on fully synthetic, MNIST, small NORB, and Waterbirds datasets, covering binary and multi-valued attributes and labels, show that adaptive dataset-dependent constraints lead to the highest accuracy on unseen domains whereas incorrect constraints fail to do so. Our results demonstrate the importance of modeling the causal relationships inherent in the data-generating process.
Paper Structure (51 sections, 12 theorems, 24 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 51 sections, 12 theorems, 24 equations, 9 figures, 9 tables, 1 algorithm.

Key Result

Theorem 2.1

Consider a causal DAG $\mathcal{G}$ over $\langle \bm{X}_c, \bm{X}, \bm{A}, Y \rangle$ and a corresponding generated dataset $(\bm{x}_i, \bm{a}_i, y_i)_{i=1}^{n}$, where $\bm{X}_c$ is unobserved. Assume that graph $\mathcal{G}$ has the following property: $\bm{X}_c$ is defined as the set of all pare

Figures (9)

  • Figure 1: (a) Our multi-attribute distribution shift dataset Col+Rot-MNIST. We combine Colored MNIST https://doi.org/10.48550/arxiv.1907.02893 and Rotated MNIST 7410650 to introduce distinct shifts over Color and Rotation attributes. (b) The causal graph representing the data generating process for Col+Rot-MNIST. Color has a correlation with Y which changes across environments while Rotation varies independently. (c) Comparison with DG algorithms optimizing for different constraints shows the superiority of Causally Adaptive Constraint Minimization (CACM) (full table in Section \ref{['sec: expts']}).
  • Figure 2: (a) Canonical causal graph for specifying multi-attribute distribution shifts; (b) canonical graph with $E$-$\bm{X}_c$ correlation. Anti-causal graph shown in Suppl. \ref{['app:additional_graphs']}. Shaded nodes denote observed variables; since not all attributes may be observed, we use dotted boundary. Dashed lines denote correlation, between $\bm{X}_c$ and $E$, and $Y$ and $\bm{A}_{\overline{ind}}\:$. $E$-$\bm{X}_c$ correlation can be due to confounding, selection, or causal relationship; all our results hold for any of these relationships (see Suppl. \ref{['app:rebuttal_ObjE']}). (c) Different mechanisms for $Y$-$\bm{A}_{\overline{ind}}\:$ relationship that lead to Causal, Confounded and Selected shifts.
  • Figure 3: Causal graphs for distinct distribution shifts based on $Y-\bm{A}$ relationship.
  • Figure 4: (a), (b) Train and (c) Test domains for MNIST.
  • Figure 5: (a), (b) Train and (c) Test domains for small NORB.
  • ...and 4 more figures

Theorems & Definitions (19)

  • Definition 2.1
  • Definition 2.2
  • Theorem 2.1
  • Proposition 3.1
  • Corollary 3.1
  • Theorem 3.1
  • Corollary 3.2
  • Theorem B.1
  • proof
  • Proposition B.1
  • ...and 9 more