Rethinking Domain Adaptation and Generalization in the Era of CLIP
Ruoyu Feng, Tao Yu, Xin Jin, Xiaoyuan Yu, Lei Xiao, Zhibo Chen
TL;DR
This work reevaluates domain adaptation in the era of CLIP by showing that simple domain priors can boost zero-shot recognition and that CLIP’s broad pre-training reduces reliance on labeled source data. It introduces a four-part framework including domain-prior guided zero-shot inference, pseudo-labeling self-training, TaskRes residual tuning, and a label-free multi-source domain generalization scheme, with mathematical formulations for each component. Empirically, CLIP-based methods with domain priors and self-training outperform prior unsupervised DA approaches on DomainNet and OfficeHome, and ablations confirm the effectiveness of residual-based tuning and domain-shared vs domain-specific decomposition. The findings suggest rethinking domain adaptation benchmarks in vision-language models and offer practical, scalable strategies for task generalization across unlabeled domains.
Abstract
In recent studies on domain adaptation, significant emphasis has been placed on the advancement of learning shared knowledge from a source domain to a target domain. Recently, the large vision-language pre-trained model, i.e., CLIP has shown strong ability on zero-shot recognition, and parameter efficient tuning can further improve its performance on specific tasks. This work demonstrates that a simple domain prior boosts CLIP's zero-shot recognition in a specific domain. Besides, CLIP's adaptation relies less on source domain data due to its diverse pre-training dataset. Furthermore, we create a benchmark for zero-shot adaptation and pseudo-labeling based self-training with CLIP. Last but not least, we propose to improve the task generalization ability of CLIP from multiple unlabeled domains, which is a more practical and unique scenario. We believe our findings motivate a rethinking of domain adaptation benchmarks and the associated role of related algorithms in the era of CLIP.
