Rethinking the Evaluation Protocol of Domain Generalization

Han Yu; Xingxuan Zhang; Renzhe Xu; Jiashuo Liu; Yue He; Peng Cui

Rethinking the Evaluation Protocol of Domain Generalization

Han Yu, Xingxuan Zhang, Renzhe Xu, Jiashuo Liu, Yue He, Peng Cui

TL;DR

This paper interrogates the reliability of current domain generalization evaluation by exposing test-data leakage risks from supervised pretraining and oracle model selection. It proposes a revised protocol featuring self-supervised or scratch pretraining and evaluation across multiple test domains, accompanied by new leaderboards to enable fairer comparisons. The results show that traditional rankings can shift under the revised protocol, with self-supervised leaderboards aligning more closely with scratch-trained baselines and SWAD maintaining top performance. By providing a framework for fair OOD evaluation and highlighting the need for diverse test domains and leakage-minimizing pretraining, the work has practical implications for robust DG research and benchmarks.

Abstract

Domain generalization aims to solve the challenge of Out-of-Distribution (OOD) generalization by leveraging common knowledge learned from multiple training domains to generalize to unseen test domains. To accurately evaluate the OOD generalization ability, it is required that test data information is unavailable. However, the current domain generalization protocol may still have potential test data information leakage. This paper examines the risks of test data information leakage from two aspects of the current evaluation protocol: supervised pretraining on ImageNet and oracle model selection. We propose modifications to the current protocol that we should employ self-supervised pretraining or train from scratch instead of employing the current supervised pretraining, and we should use multiple test domains. These would result in a more precise evaluation of OOD generalization ability. We also rerun the algorithms with the modified protocol and introduce new leaderboards to encourage future research in domain generalization with a fairer comparison.

Rethinking the Evaluation Protocol of Domain Generalization

TL;DR

Abstract

Paper Structure (30 sections, 1 equation, 1 figure, 21 tables)

This paper contains 30 sections, 1 equation, 1 figure, 21 tables.

Introduction
New leaderboards
Rethinking the Evaluation Protocol
Pretraining
Oracle Model Selection
New Leaderboards
Experimental settings
Protocol modifications
Algorithms
Other details
Results
Performance rankings of some algorithms show great variations after applying the modified protocol.
Rankings in self-supervised pretraining leaderboards are more consistent with leaderboards of training from scratch than supervised pretraining leaderboards.
SWAD consistently ranks 1st in every leaderboard.
Discussion
...and 15 more sections

Figures (1)

Figure 1: Test domain accuracy with the growing number of frozen layers when using ImageNet pretrained weights. Test domains include PASCAL from VLCS, Art and Real from OfficeHome, and Real from DomainNet.

Theorems & Definitions (1)

Definition 1: Domain Generalization

Rethinking the Evaluation Protocol of Domain Generalization

TL;DR

Abstract

Rethinking the Evaluation Protocol of Domain Generalization

Authors

TL;DR

Abstract

Table of Contents

Figures (1)

Theorems & Definitions (1)