Vision Transformer-based Adversarial Domain Adaptation
Yahan Li, Yuan Wu
TL;DR
This paper tackles unsupervised domain adaptation under domain shift by introducing VT-ADA, which replaces CNN backbones with Vision Transformer (ViT) as the feature extractor in adversarial domain adaptation frameworks. The authors implement ViT within both DANN and CDAN paradigms and demonstrate that VT-ADA yields superior domain-invariant representations across Office-31, ImageCLEF, and Office-Home benchmarks, with particularly strong performance for VT-ADA(CDAN). They provide empirical evidence of improved transferability, discriminability, and faster convergence, supporting the claim that ViT can serve as a plug-and-play component in ADA. The work highlights the practical potential of ViT to enhance cross-domain generalization in vision tasks with limited target labels and offers a public code release for reproducibility.
Abstract
Unsupervised domain adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. The most recent UDA methods always resort to adversarial training to yield state-of-the-art results and a dominant number of existing UDA methods employ convolutional neural networks (CNNs) as feature extractors to learn domain invariant features. Vision transformer (ViT) has attracted tremendous attention since its emergence and has been widely used in various computer vision tasks, such as image classification, object detection, and semantic segmentation, yet its potential in adversarial domain adaptation has never been investigated. In this paper, we fill this gap by employing the ViT as the feature extractor in adversarial domain adaptation. Moreover, we empirically demonstrate that ViT can be a plug-and-play component in adversarial domain adaptation, which means directly replacing the CNN-based feature extractor in existing UDA methods with the ViT-based feature extractor can easily obtain performance improvement. The code is available at https://github.com/LluckyYH/VT-ADA.
