Table of Contents
Fetching ...

ERM++: An Improved Baseline for Domain Generalization

Piotr Teterwak, Kuniaki Saito, Theodoros Tsiligkaridis, Kate Saenko, Bryan A. Plummer

TL;DR

ERM++ revisits the empirical risk minimization baseline for multi-source domain generalization, arguing that careful training procedure choices can surpass complex DG methods. The method decomposes improvements into Training Data Utilization (Auto-LR and Full Data retraining), Initialization (strong pre-trained weights), and Regularization (weight-space regularization like MPA, WS, UBN, and Attention Tuning). Across five DomainBed datasets and both ResNet-50 and ViT-B/16 backbones, ERM++ yields sizable gains over ERM baselines (roughly 5% with ResNet-50 and over 15% with ViT-B/16) and matches or surpasses current SOTA methods. The approach remains easy to implement and integrate into existing pipelines, including DomainBed, and emphasizes a strong, simple baseline for future DG research; code is publicly available.

Abstract

Domain Generalization (DG) aims to develop classifiers that can generalize to new, unseen data distributions, a critical capability when collecting new domain-specific data is impractical. A common DG baseline minimizes the empirical risk on the source domains. Recent studies have shown that this approach, known as Empirical Risk Minimization (ERM), can outperform most more complex DG methods when properly tuned. However, these studies have primarily focused on a narrow set of hyperparameters, neglecting other factors that can enhance robustness and prevent overfitting and catastrophic forgetting, properties which are critical for strong DG performance. In our investigation of training data utilization (i.e., duration and setting validation splits), initialization, and additional regularizers, we find that tuning these previously overlooked factors significantly improves model generalization across diverse datasets without adding much complexity. We call this improved, yet simple baseline ERM++. Despite its ease of implementation, ERM++ improves DG performance by over 5\% compared to prior ERM baselines on a standard benchmark of 5 datasets with a ResNet-50 and over 15\% with a ViT-B/16. It also outperforms all state-of-the-art methods on DomainBed datasets with both architectures. Importantly, ERM++ is easy to integrate into existing frameworks like DomainBed, making it a practical and powerful tool for researchers and practitioners. Overall, ERM++ challenges the need for more complex DG methods by providing a stronger, more reliable baseline that maintains simplicity and ease of use. Code is available at \url{https://github.com/piotr-teterwak/erm_plusplus}

ERM++: An Improved Baseline for Domain Generalization

TL;DR

ERM++ revisits the empirical risk minimization baseline for multi-source domain generalization, arguing that careful training procedure choices can surpass complex DG methods. The method decomposes improvements into Training Data Utilization (Auto-LR and Full Data retraining), Initialization (strong pre-trained weights), and Regularization (weight-space regularization like MPA, WS, UBN, and Attention Tuning). Across five DomainBed datasets and both ResNet-50 and ViT-B/16 backbones, ERM++ yields sizable gains over ERM baselines (roughly 5% with ResNet-50 and over 15% with ViT-B/16) and matches or surpasses current SOTA methods. The approach remains easy to implement and integrate into existing pipelines, including DomainBed, and emphasizes a strong, simple baseline for future DG research; code is publicly available.

Abstract

Domain Generalization (DG) aims to develop classifiers that can generalize to new, unseen data distributions, a critical capability when collecting new domain-specific data is impractical. A common DG baseline minimizes the empirical risk on the source domains. Recent studies have shown that this approach, known as Empirical Risk Minimization (ERM), can outperform most more complex DG methods when properly tuned. However, these studies have primarily focused on a narrow set of hyperparameters, neglecting other factors that can enhance robustness and prevent overfitting and catastrophic forgetting, properties which are critical for strong DG performance. In our investigation of training data utilization (i.e., duration and setting validation splits), initialization, and additional regularizers, we find that tuning these previously overlooked factors significantly improves model generalization across diverse datasets without adding much complexity. We call this improved, yet simple baseline ERM++. Despite its ease of implementation, ERM++ improves DG performance by over 5\% compared to prior ERM baselines on a standard benchmark of 5 datasets with a ResNet-50 and over 15\% with a ViT-B/16. It also outperforms all state-of-the-art methods on DomainBed datasets with both architectures. Importantly, ERM++ is easy to integrate into existing frameworks like DomainBed, making it a practical and powerful tool for researchers and practitioners. Overall, ERM++ challenges the need for more complex DG methods by providing a stronger, more reliable baseline that maintains simplicity and ease of use. Code is available at \url{https://github.com/piotr-teterwak/erm_plusplus}
Paper Structure (25 sections, 17 figures, 13 tables)

This paper contains 25 sections, 17 figures, 13 tables.

Figures (17)

  • Figure 1: Our goal is to provide a simple yet effective baseline that enhances robustness without adding complexity. Pre-training data (green) is often more similar to target data (purple) than source data (blue). Preventing catastrophic forgetting and overfitting to the source is thus critical in DG. We introduce ERM++, which addresses this through three key principles: Training Data Utilization (Sec. \ref{['subsec:data_util']}), Initialization (Sec. \ref{['subsec:params']}), and Regularization (Sec. \ref{['subsec:weightspace_reg']}).
  • Figure 2: Prior baselines like ERM gulrajani2020search tune a small set of hyper-parameters. We extend tuning to other factors that control catastrophic forgetting and overfitting to the source.
  • Figure 3: OfficeHome: Samples from the OfficeHomevenkateswara2017deep dataset, from each domain and selected classes. The dataset focuses on household objects. The domain shifts are in low-level style mostly, and there is little spatial bias.
  • Figure 4: Examples of Attention Tuning Visualization of DINOv2 model. We average over all attention heads in the final attention block. On a pretrained model, attention is scattered. On both an attention tuned and full fine-tuned model, attention is more focused than with a pre-trained model. However, on some samples (representive samples pictured here) full fine-tuning misses discriminative but occluded animal features. On the top-right images, the attention tuning picks up the dog. In the bottom-right, the attention-tuned model picks up a tail in the lower-left corner.
  • Figure 5: DomainNet: Samples from the DomainNetpeng2019moment dataset. While the real domain is quite similar to what one might expect in ImageNet, the distribution shifts are quite substantial in other domains. Quickdraw and Infograph are particularly challenging, so the 1-3% gains of ERM++ on these domains is meaningful (Table \ref{['tab:dn']}). While most domains contain primarily shifts in low level statistics (for example, real to painting), Infograph also has many non-centered objects.
  • ...and 12 more figures