Table of Contents
Fetching ...

Rethinking Domain Adaptation and Generalization in the Era of CLIP

Ruoyu Feng, Tao Yu, Xin Jin, Xiaoyuan Yu, Lei Xiao, Zhibo Chen

TL;DR

This work reevaluates domain adaptation in the era of CLIP by showing that simple domain priors can boost zero-shot recognition and that CLIP’s broad pre-training reduces reliance on labeled source data. It introduces a four-part framework including domain-prior guided zero-shot inference, pseudo-labeling self-training, TaskRes residual tuning, and a label-free multi-source domain generalization scheme, with mathematical formulations for each component. Empirically, CLIP-based methods with domain priors and self-training outperform prior unsupervised DA approaches on DomainNet and OfficeHome, and ablations confirm the effectiveness of residual-based tuning and domain-shared vs domain-specific decomposition. The findings suggest rethinking domain adaptation benchmarks in vision-language models and offer practical, scalable strategies for task generalization across unlabeled domains.

Abstract

In recent studies on domain adaptation, significant emphasis has been placed on the advancement of learning shared knowledge from a source domain to a target domain. Recently, the large vision-language pre-trained model, i.e., CLIP has shown strong ability on zero-shot recognition, and parameter efficient tuning can further improve its performance on specific tasks. This work demonstrates that a simple domain prior boosts CLIP's zero-shot recognition in a specific domain. Besides, CLIP's adaptation relies less on source domain data due to its diverse pre-training dataset. Furthermore, we create a benchmark for zero-shot adaptation and pseudo-labeling based self-training with CLIP. Last but not least, we propose to improve the task generalization ability of CLIP from multiple unlabeled domains, which is a more practical and unique scenario. We believe our findings motivate a rethinking of domain adaptation benchmarks and the associated role of related algorithms in the era of CLIP.

Rethinking Domain Adaptation and Generalization in the Era of CLIP

TL;DR

This work reevaluates domain adaptation in the era of CLIP by showing that simple domain priors can boost zero-shot recognition and that CLIP’s broad pre-training reduces reliance on labeled source data. It introduces a four-part framework including domain-prior guided zero-shot inference, pseudo-labeling self-training, TaskRes residual tuning, and a label-free multi-source domain generalization scheme, with mathematical formulations for each component. Empirically, CLIP-based methods with domain priors and self-training outperform prior unsupervised DA approaches on DomainNet and OfficeHome, and ablations confirm the effectiveness of residual-based tuning and domain-shared vs domain-specific decomposition. The findings suggest rethinking domain adaptation benchmarks in vision-language models and offer practical, scalable strategies for task generalization across unlabeled domains.

Abstract

In recent studies on domain adaptation, significant emphasis has been placed on the advancement of learning shared knowledge from a source domain to a target domain. Recently, the large vision-language pre-trained model, i.e., CLIP has shown strong ability on zero-shot recognition, and parameter efficient tuning can further improve its performance on specific tasks. This work demonstrates that a simple domain prior boosts CLIP's zero-shot recognition in a specific domain. Besides, CLIP's adaptation relies less on source domain data due to its diverse pre-training dataset. Furthermore, we create a benchmark for zero-shot adaptation and pseudo-labeling based self-training with CLIP. Last but not least, we propose to improve the task generalization ability of CLIP from multiple unlabeled domains, which is a more practical and unique scenario. We believe our findings motivate a rethinking of domain adaptation benchmarks and the associated role of related algorithms in the era of CLIP.
Paper Structure (15 sections, 6 equations, 2 figures, 5 tables)

This paper contains 15 sections, 6 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Motivation of this paper. Since CLIP has already equipped with strong zero-shot ability, we propose to carefully fine-tune the decision boundary of target domain by learning a task residual with pseudo-labling based self-training.
  • Figure 2: Training and inference pipelines for learning task information from multiple unlabeled domain data for domain generalization. a) During training progress, task residual is disentangled into "shared" and "specific". b) During inference progress, we only take use of share residual which contains common task-adaptive knowledge.