Table of Contents
Fetching ...

Understanding Domain Generalization: A Noise Robustness Perspective

Rui Qiao, Bryan Kian Hsiang Low

TL;DR

The paper investigates domain generalization under label noise, asking when DG methods outperform ERM. It combines finite-sample theory with empirical evaluation to show that ERM can be misled by spurious correlations under label noise, while some DG approaches exhibit implicit robustness by shaping optimization dynamics, particularly in synthetic settings. However, results on real-world benchmarks reveal no consistent advantage of DG over ERM, suggesting that premium gains depend on dataset characteristics, pretraining, and augmentation strategies. The work clarifies the gap between theoretical noise-robustness and practical generalization, arguing for more theoretical development and benchmarking to determine when invariant learning truly improves out-of-distribution performance.

Abstract

Despite the rapid development of machine learning algorithms for domain generalization (DG), there is no clear empirical evidence that the existing DG algorithms outperform the classic empirical risk minimization (ERM) across standard benchmarks. To better understand this phenomenon, we investigate whether there are benefits of DG algorithms over ERM through the lens of label noise. Specifically, our finite-sample analysis reveals that label noise exacerbates the effect of spurious correlations for ERM, undermining generalization. Conversely, we illustrate that DG algorithms exhibit implicit label-noise robustness during finite-sample training even when spurious correlation is present. Such desirable property helps mitigate spurious correlations and improve generalization in synthetic experiments. However, additional comprehensive experiments on real-world benchmark datasets indicate that label-noise robustness does not necessarily translate to better performance compared to ERM. We conjecture that the failure mode of ERM arising from spurious correlations may be less pronounced in practice.

Understanding Domain Generalization: A Noise Robustness Perspective

TL;DR

The paper investigates domain generalization under label noise, asking when DG methods outperform ERM. It combines finite-sample theory with empirical evaluation to show that ERM can be misled by spurious correlations under label noise, while some DG approaches exhibit implicit robustness by shaping optimization dynamics, particularly in synthetic settings. However, results on real-world benchmarks reveal no consistent advantage of DG over ERM, suggesting that premium gains depend on dataset characteristics, pretraining, and augmentation strategies. The work clarifies the gap between theoretical noise-robustness and practical generalization, arguing for more theoretical development and benchmarking to determine when invariant learning truly improves out-of-distribution performance.

Abstract

Despite the rapid development of machine learning algorithms for domain generalization (DG), there is no clear empirical evidence that the existing DG algorithms outperform the classic empirical risk minimization (ERM) across standard benchmarks. To better understand this phenomenon, we investigate whether there are benefits of DG algorithms over ERM through the lens of label noise. Specifically, our finite-sample analysis reveals that label noise exacerbates the effect of spurious correlations for ERM, undermining generalization. Conversely, we illustrate that DG algorithms exhibit implicit label-noise robustness during finite-sample training even when spurious correlation is present. Such desirable property helps mitigate spurious correlations and improve generalization in synthetic experiments. However, additional comprehensive experiments on real-world benchmark datasets indicate that label-noise robustness does not necessarily translate to better performance compared to ERM. We conjecture that the failure mode of ERM arising from spurious correlations may be less pronounced in practice.
Paper Structure (72 sections, 7 theorems, 49 equations, 6 figures, 5 tables)

This paper contains 72 sections, 7 theorems, 49 equations, 6 figures, 5 tables.

Key Result

Theorem 4.2

Under Assumption assump:separable, if $\lVert {\mathbf{w}}^{(\textnormal{inv})}_{\textnormal{inv}} \rVert^2 - \lVert {\mathbf{w}}^{(\textnormal{spu})}_{\textnormal{spu}} \rVert^2 \geq n(1-\gamma)(1-2\eta)C$, then with high probability, $\lVert {\mathbf{w}}^{(\textnormal{inv})} \rVert \geq \lVert {\m

Figures (6)

  • Figure 1: Simulation on synthetic data trained with overparameterized logistic regression. The dotted lines are the means and the shaded regions are for the standard error from 5 independent runs. Figure \ref{['fig:noise']} shows that the minority-group (worst-group) error increases much more than the majority-group error when label noise is injected. Figure \ref{['fig:ndata']} indicates that gathering more data effectively reduces all test errors despite the presence of label noise. Figure \ref{['fig:decision-boundary']} visualizes the learned decision boundaries for the same training set under different $\eta$, where markers "$\circ$" and "$\times$" are for the majority and the minority groups respectively. All data points are colored by the true labels. By adding more label noise, the classifier becomes more skewed towards using the spurious features. Figure \ref{['fig:noise-norm']} shows that as more noise is present, ${\mathbf{w}}^{(\textnormal{spu})}$ indeed tends to have a smaller norm than ${\mathbf{w}}^{(\textnormal{inv})}$, even though it is bigger when $\eta=0$.
  • Figure 2: IRMv1 gradient coefficient function $\alpha$ w.r.t. regularization strength $\lambda$. The shaded area represents $\alpha(\phi)<0$. As $\lambda$ increases, the valley below 0 also deepens, providing stronger resistance.
  • Figure 3: Simulation on CMNIST dataset. As the noise level $\eta$ increases, both IRM and V-REx exhibit better generalization and noise robustness compared to approaches with ERM objectives.
  • Figure 4: IRM gradient coefficient w.r.t. regularization strength $\lambda$
  • Figure 5: Decision Boundaries for $\sigma^2_{\textnormal{inv}}=\sigma^2_{\textnormal{spu}}=0.25, {d_{\textnormal{inv}}}={d_{\textnormal{spu}}}=1, \lambda=10^{-4}$
  • ...and 1 more figures

Theorems & Definitions (12)

  • Theorem 4.2
  • Lemma C.1
  • proof
  • Theorem C.2
  • proof
  • Remark C.3
  • Lemma C.4
  • Theorem C.5: soudry2018implicit
  • Proposition C.6
  • proof
  • ...and 2 more