Table of Contents
Fetching ...

Domain Generalization: A Tale of Two ERMs

Yilun Zhu, Naihao Deng, Naichen Shi, Aditya Gangrade, Clayton Scott

TL;DR

This work reframes domain generalization (DG) by distinguishing covariate shift from posterior drift and shows that domain-informed ERM (DI-ERM) can outperform pooling ERM when the conditional label distribution $P_{Y|X,D}$ drifts across domains. The authors formalize a general DG framework with joint distribution $P_{XYMD}$ and metadata $M$, derive Bayes-risk bounds, and establish a concrete gain bound under posterior drift: $\frac{\gamma\epsilon}{2}$. They prove a risk hierarchy $R^*_{pool} \ge R^*_{DG} \ge R^*_{full}$ and demonstrate that domain information yields a provable advantage in posterior-drift scenarios, while under covariate shift the Bayes gain vanishes though restricted-function-class effects may yield empirical gains. Empirically, DI-ERM improves generalization across NLP annotator disagreement, reviewer-level bias in reviews, and image-style shifts, with larger gains for mid-sized models and diminishing returns for large models. The results underscore the practical value of leveraging domain metadata for robust DG and clarify when such information is essential versus incidental.

Abstract

Domain generalization (DG) is the problem of generalizing from several distributions (or domains), for which labeled training data are available, to a new test domain for which no labeled data is available. A common finding in the DG literature is that it is difficult to outperform empirical risk minimization (ERM) on the pooled training data. In this work, we argue that this finding has primarily been reported for datasets satisfying a \emph{covariate shift} assumption. When the dataset satisfies a \emph{posterior drift} assumption instead, we show that ``domain-informed ERM,'' wherein feature vectors are augmented with domain-specific information, outperforms pooling ERM. These claims are supported by a theoretical framework and experiments on language and vision tasks.

Domain Generalization: A Tale of Two ERMs

TL;DR

This work reframes domain generalization (DG) by distinguishing covariate shift from posterior drift and shows that domain-informed ERM (DI-ERM) can outperform pooling ERM when the conditional label distribution drifts across domains. The authors formalize a general DG framework with joint distribution and metadata , derive Bayes-risk bounds, and establish a concrete gain bound under posterior drift: . They prove a risk hierarchy and demonstrate that domain information yields a provable advantage in posterior-drift scenarios, while under covariate shift the Bayes gain vanishes though restricted-function-class effects may yield empirical gains. Empirically, DI-ERM improves generalization across NLP annotator disagreement, reviewer-level bias in reviews, and image-style shifts, with larger gains for mid-sized models and diminishing returns for large models. The results underscore the practical value of leveraging domain metadata for robust DG and clarify when such information is essential versus incidental.

Abstract

Domain generalization (DG) is the problem of generalizing from several distributions (or domains), for which labeled training data are available, to a new test domain for which no labeled data is available. A common finding in the DG literature is that it is difficult to outperform empirical risk minimization (ERM) on the pooled training data. In this work, we argue that this finding has primarily been reported for datasets satisfying a \emph{covariate shift} assumption. When the dataset satisfies a \emph{posterior drift} assumption instead, we show that ``domain-informed ERM,'' wherein feature vectors are augmented with domain-specific information, outperforms pooling ERM. These claims are supported by a theoretical framework and experiments on language and vision tasks.

Paper Structure

This paper contains 32 sections, 6 theorems, 62 equations, 5 figures, 9 tables.

Key Result

Proposition 1

Figures (5)

  • Figure 1: Illustration of \ref{['thm:dg_improve']}. Consider binary classification with $X \in \mathbb{R}$, $Y \in \{1, 2\}$, and $M \in \{1,2\}$. Then the Bayes classifiers $f^*_{\mathrm{pool}}(x)$, $f^*_{\mathrm{DG}}(x, m=1)$ and $f^*_{\mathrm{DG}}(x, m=2)$ can be obtained by thresholding the corresponding posteriors at $1/2$. The left figure shows a scenario where the domain-informed classifier $f^*_{\mathrm{DG}}$ and the pooled classifier $f^*_{\mathrm{pool}}$ agree everywhere, and therefore both upper and lower bound are $0$. In this case, domain information $M$ is not beneficial. The right figure shows a scenario where $f^*_{\mathrm{DG}}$ disagrees with $f^*_{\mathrm{pool}}$ in certain regions, and domain information does lead to lower Bayes risk.
  • Figure 2: Illustration of Example \ref{['eg:covariate_shift_improve']}, where $R^*_{\mathrm{pool},\, \mathcal{G}} > R^*_{\mathrm{DG},\, \mathcal{F}}$.
  • Figure 3: Text prompt that encodes annotator profile.
  • Figure 4: Text prompt that encodes reviewer writing style
  • Figure 5: Example of style-specific text prompts used as domain descriptions.

Theorems & Definitions (13)

  • Remark 1
  • Remark 2
  • Remark 3
  • Proposition 1: Risk Hierarchy
  • Definition 1: Point-wise Margin
  • Theorem 1: Risk Reduction w/ Domain Info
  • Definition 2: Posterior Drift Class for DG
  • Proposition 2: Gain Under Posterior Drift
  • Remark 4
  • Theorem 2: $R^*_{\mathrm{DG}}$ vs. $R^*_{\mathrm{full}}$
  • ...and 3 more