Table of Contents
Fetching ...

Domain Generalization in-the-Wild: Disentangling Classification from Domain-Aware Representations

Ha Min Son, Zhe Zhao, Shahbaz Rezaei, Xin Liu

TL;DR

This paper addresses the difficulty of evaluating domain generalization for foundation models trained on web-scale data, where true OOD robustness is hard to assess due to potential data leakage. It introduces a more challenging in-the-wild evaluation across 33 diverse datasets and a novel unlearning probe to simulate unseen domains. The proposed CLIP-DCA method disentangles classification from enhanced domain-aware representations by adding an image domain head and using synthetic diffusion domains plus MLLM-derived signals, while enforcing domain-invariant classification at the final layer through disentanglement. Empirically, CLIP-DCA yields stronger OOD robustness than standard finetuning and many baselines, especially on more OOD targets, and the study demonstrates the importance of balancing domain awareness with classifier invariance for robust generalization in large pretrained models.

Abstract

Evaluating domain generalization (DG) for foundational models like CLIP is challenging, as web-scale pretraining data potentially covers many existing benchmarks. Consequently, current DG evaluation may neither be sufficiently challenging nor adequately test genuinely unseen data scenarios. To better assess the performance of CLIP on DG in-the-wild, a scenario where CLIP encounters challenging unseen data, we consider two approaches: (1) evaluating on 33 diverse datasets with quantified out-of-distribution (OOD) scores after fine-tuning CLIP on ImageNet, and (2) using unlearning to make CLIP `forget' some domains as an approximation. We observe that CLIP's performance deteriorates significantly on more OOD datasets. To address this, we present CLIP-DCA (Disentangling Classification from enhanced domain Aware representations). Our approach is motivated by the observation that while standard domain invariance losses aim to make representations domain-invariant, this can be harmful to foundation models by forcing the discarding of domain-aware representations beneficial for generalization. We instead hypothesize that enhancing domain awareness is a prerequisite for effective domain-invariant classification in foundation models. CLIP-DCA identifies and enhances domain awareness within CLIP's encoders using a separate domain head and synthetically generated diverse domain data. Simultaneously, it encourages domain-invariant classification through disentanglement from the domain features. CLIP-DCA shows significant improvements within this challenging evaluation compared to existing methods, particularly on datasets that are more OOD.

Domain Generalization in-the-Wild: Disentangling Classification from Domain-Aware Representations

TL;DR

This paper addresses the difficulty of evaluating domain generalization for foundation models trained on web-scale data, where true OOD robustness is hard to assess due to potential data leakage. It introduces a more challenging in-the-wild evaluation across 33 diverse datasets and a novel unlearning probe to simulate unseen domains. The proposed CLIP-DCA method disentangles classification from enhanced domain-aware representations by adding an image domain head and using synthetic diffusion domains plus MLLM-derived signals, while enforcing domain-invariant classification at the final layer through disentanglement. Empirically, CLIP-DCA yields stronger OOD robustness than standard finetuning and many baselines, especially on more OOD targets, and the study demonstrates the importance of balancing domain awareness with classifier invariance for robust generalization in large pretrained models.

Abstract

Evaluating domain generalization (DG) for foundational models like CLIP is challenging, as web-scale pretraining data potentially covers many existing benchmarks. Consequently, current DG evaluation may neither be sufficiently challenging nor adequately test genuinely unseen data scenarios. To better assess the performance of CLIP on DG in-the-wild, a scenario where CLIP encounters challenging unseen data, we consider two approaches: (1) evaluating on 33 diverse datasets with quantified out-of-distribution (OOD) scores after fine-tuning CLIP on ImageNet, and (2) using unlearning to make CLIP `forget' some domains as an approximation. We observe that CLIP's performance deteriorates significantly on more OOD datasets. To address this, we present CLIP-DCA (Disentangling Classification from enhanced domain Aware representations). Our approach is motivated by the observation that while standard domain invariance losses aim to make representations domain-invariant, this can be harmful to foundation models by forcing the discarding of domain-aware representations beneficial for generalization. We instead hypothesize that enhancing domain awareness is a prerequisite for effective domain-invariant classification in foundation models. CLIP-DCA identifies and enhances domain awareness within CLIP's encoders using a separate domain head and synthetically generated diverse domain data. Simultaneously, it encourages domain-invariant classification through disentanglement from the domain features. CLIP-DCA shows significant improvements within this challenging evaluation compared to existing methods, particularly on datasets that are more OOD.

Paper Structure

This paper contains 31 sections, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Improvement over zeroshot after finetuning on ImageNet (in %). Each dot represents a target dataset. OOD scores are quantified relative to ImageNet (source dataset), illustrating the challenge of DG in-the-wild.
  • Figure 1: Unlearning effectiveness. ZS: Original zero-shot performance. FT: Baseline fine-tuning on the GCC retention set. Unlearn: Full unlearning combining retention on GCC with adversarial unlearning on DomainNet.
  • Figure 2: CLIP-DCA applies different sets of losses to source data images and diffusion images. For source images, accurate classification is encouraged through the classification loss between class head and text encoder ($C_1$). Invariance is encouraged through the disentanglement between domain and class heads ($C_2$). With diffusion images, domain invariance is encouraged through the disentanglement between the domain and class heads ($C_3$), and disentanglement between class head and text encoder ($C_4$). Domain awareness is encouraged through the agreement between the domain head and the text encoder ($C_5$), and the agreement between the text encoder and the MLLM hidden states ($C_6$). During inference, only the class head and text projector are used for classification.
  • Figure 2: Accuracy on ImageNet variants
  • Figure 3: Standard CLIP inference pipeline using a dot product between image and text embeddings for classification.
  • ...and 8 more figures