Domain Generalization: A Tale of Two ERMs
Yilun Zhu, Naihao Deng, Naichen Shi, Aditya Gangrade, Clayton Scott
TL;DR
This work reframes domain generalization (DG) by distinguishing covariate shift from posterior drift and shows that domain-informed ERM (DI-ERM) can outperform pooling ERM when the conditional label distribution $P_{Y|X,D}$ drifts across domains. The authors formalize a general DG framework with joint distribution $P_{XYMD}$ and metadata $M$, derive Bayes-risk bounds, and establish a concrete gain bound under posterior drift: $\frac{\gamma\epsilon}{2}$. They prove a risk hierarchy $R^*_{pool} \ge R^*_{DG} \ge R^*_{full}$ and demonstrate that domain information yields a provable advantage in posterior-drift scenarios, while under covariate shift the Bayes gain vanishes though restricted-function-class effects may yield empirical gains. Empirically, DI-ERM improves generalization across NLP annotator disagreement, reviewer-level bias in reviews, and image-style shifts, with larger gains for mid-sized models and diminishing returns for large models. The results underscore the practical value of leveraging domain metadata for robust DG and clarify when such information is essential versus incidental.
Abstract
Domain generalization (DG) is the problem of generalizing from several distributions (or domains), for which labeled training data are available, to a new test domain for which no labeled data is available. A common finding in the DG literature is that it is difficult to outperform empirical risk minimization (ERM) on the pooled training data. In this work, we argue that this finding has primarily been reported for datasets satisfying a \emph{covariate shift} assumption. When the dataset satisfies a \emph{posterior drift} assumption instead, we show that ``domain-informed ERM,'' wherein feature vectors are augmented with domain-specific information, outperforms pooling ERM. These claims are supported by a theoretical framework and experiments on language and vision tasks.
