Table of Contents
Fetching ...

Towards a Better Evaluation of Out-of-Domain Generalization

Duhun Hwang, Suhyun Kang, Moonjung Eo, Jimyeong Kim, Wonjong Rhee

TL;DR

This paper challenges the long-standing reliance on the average evaluation metric for domain generalization by introducing the worst+gap measure, which jointly accounts for the worst-case performance and the spread across environments. The authors provide two theoretical results—one under a uniform-risk assumption and another with a decreasing risk range—that bound the ideal DG performance using the worst and gap components, thereby grounding the new measure. They validate the approach across multiple, increasingly realistic DG datasets (including SR-CMNIST, C-Cats&Dogs, L-CIFAR10, and real-world PACS/VLCS corruptions) and demonstrate that worst+gap consistently correlates more strongly with the ideal measure and better supports selecting the truly best DG algorithm. The work also introduces five new datasets to study DG measures and discusses practical implications for algorithm selection, asymptotics with more environments, and ERM limitations. Overall, worst+gap offers a robust and practically useful alternative for evaluating and advancing out-of-domain generalization.

Abstract

The objective of Domain Generalization (DG) is to devise algorithms and models capable of achieving high performance on previously unseen test distributions. In the pursuit of this objective, average measure has been employed as the prevalent measure for evaluating models and comparing algorithms in the existing DG studies. Despite its significance, a comprehensive exploration of the average measure has been lacking and its suitability in approximating the true domain generalization performance has been questionable. In this study, we carefully investigate the limitations inherent in the average measure and propose worst+gap measure as a robust alternative. We establish theoretical grounds of the proposed measure by deriving two theorems starting from two different assumptions. We conduct extensive experimental investigations to compare the proposed worst+gap measure with the conventional average measure. Given the indispensable need to access the true DG performance for studying measures, we modify five existing datasets to come up with SR-CMNIST, C-Cats&Dogs, L-CIFAR10, PACS-corrupted, and VLCS-corrupted datasets. The experiment results unveil an inferior performance of the average measure in approximating the true DG performance and confirm the robustness of the theoretically supported worst+gap measure.

Towards a Better Evaluation of Out-of-Domain Generalization

TL;DR

This paper challenges the long-standing reliance on the average evaluation metric for domain generalization by introducing the worst+gap measure, which jointly accounts for the worst-case performance and the spread across environments. The authors provide two theoretical results—one under a uniform-risk assumption and another with a decreasing risk range—that bound the ideal DG performance using the worst and gap components, thereby grounding the new measure. They validate the approach across multiple, increasingly realistic DG datasets (including SR-CMNIST, C-Cats&Dogs, L-CIFAR10, and real-world PACS/VLCS corruptions) and demonstrate that worst+gap consistently correlates more strongly with the ideal measure and better supports selecting the truly best DG algorithm. The work also introduces five new datasets to study DG measures and discusses practical implications for algorithm selection, asymptotics with more environments, and ERM limitations. Overall, worst+gap offers a robust and practically useful alternative for evaluating and advancing out-of-domain generalization.

Abstract

The objective of Domain Generalization (DG) is to devise algorithms and models capable of achieving high performance on previously unseen test distributions. In the pursuit of this objective, average measure has been employed as the prevalent measure for evaluating models and comparing algorithms in the existing DG studies. Despite its significance, a comprehensive exploration of the average measure has been lacking and its suitability in approximating the true domain generalization performance has been questionable. In this study, we carefully investigate the limitations inherent in the average measure and propose worst+gap measure as a robust alternative. We establish theoretical grounds of the proposed measure by deriving two theorems starting from two different assumptions. We conduct extensive experimental investigations to compare the proposed worst+gap measure with the conventional average measure. Given the indispensable need to access the true DG performance for studying measures, we modify five existing datasets to come up with SR-CMNIST, C-Cats&Dogs, L-CIFAR10, PACS-corrupted, and VLCS-corrupted datasets. The experiment results unveil an inferior performance of the average measure in approximating the true DG performance and confirm the robustness of the theoretically supported worst+gap measure.
Paper Structure (29 sections, 4 theorems, 39 equations, 11 figures, 6 tables)

This paper contains 29 sections, 4 theorems, 39 equations, 11 figures, 6 tables.

Key Result

Theorem 3.1

(Chebyshev's inequality markov1884certainbienayme1853considerations ) Let $X$ be a random variable with finite expected value $\mu$ and finite non-zero variance $\sigma^2$. Then for any real number $k>0$,

Figures (11)

  • Figure 1: Comparison between ERM and the best performing algorithm ($A^{*}_{\text{IDEAL}}$) for different Ratio configurations of SR-CMNIST dataset. For each plot, (a) true error rate, (b) worst+gap measure, or (c) average measure is used as the measure. Compared to the true error rate shown in (a), the average measure shown in (c) distorts the assessment. The results for Scale$=4$ are shown. The results for the other Scale values can be found in Section \ref{['sec:fulltable_erm']}.
  • Figure 2: Invariant and spurious features of SR-CMNIST. In SR-CMNIST, color serves as the spurious feature.
  • Figure 3: Invariant and spurious features of C-Cats&Dogs and L-CIFAR10 datasets. In C-Cats&Dogs datasets, color serves as the spurious feature. In L-CIFAR10 dataset, colored line serves as the spurious feature.
  • Figure 4: Our PACS-corrupted dataset consists of 64 environments, created by applying 15 different corruptions to the original 4 base environments of PACS. In the figures, (a) illustrates the 4 base environments of PACS and (b) shows the 15 corruptions hendrycks2019benchmarking utilized to increase the number of environments. Similarly, VLCS-corrupted dataset is constructed by applying the 15 filters to the VLCS dataset.
  • Figure 5: Scatter plots of the raw measure values. From the SR-CMNIST experiments, we gather the raw evaluation values for the ideal measure, worst+gap measure, and average measure. The scatter plots are produced using the three sets where the red dots are for 'ideal measure vs. worst+gap measure' and the blue dots are for 'ideal measure vs. average measure'.
  • ...and 6 more figures

Theorems & Definitions (7)

  • Theorem 3.1
  • Lemma 3.2
  • proof
  • Theorem 3.3
  • proof
  • Theorem 3.4
  • proof