Table of Contents
Fetching ...

Vision Transformer-based Adversarial Domain Adaptation

Yahan Li, Yuan Wu

TL;DR

This paper tackles unsupervised domain adaptation under domain shift by introducing VT-ADA, which replaces CNN backbones with Vision Transformer (ViT) as the feature extractor in adversarial domain adaptation frameworks. The authors implement ViT within both DANN and CDAN paradigms and demonstrate that VT-ADA yields superior domain-invariant representations across Office-31, ImageCLEF, and Office-Home benchmarks, with particularly strong performance for VT-ADA(CDAN). They provide empirical evidence of improved transferability, discriminability, and faster convergence, supporting the claim that ViT can serve as a plug-and-play component in ADA. The work highlights the practical potential of ViT to enhance cross-domain generalization in vision tasks with limited target labels and offers a public code release for reproducibility.

Abstract

Unsupervised domain adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. The most recent UDA methods always resort to adversarial training to yield state-of-the-art results and a dominant number of existing UDA methods employ convolutional neural networks (CNNs) as feature extractors to learn domain invariant features. Vision transformer (ViT) has attracted tremendous attention since its emergence and has been widely used in various computer vision tasks, such as image classification, object detection, and semantic segmentation, yet its potential in adversarial domain adaptation has never been investigated. In this paper, we fill this gap by employing the ViT as the feature extractor in adversarial domain adaptation. Moreover, we empirically demonstrate that ViT can be a plug-and-play component in adversarial domain adaptation, which means directly replacing the CNN-based feature extractor in existing UDA methods with the ViT-based feature extractor can easily obtain performance improvement. The code is available at https://github.com/LluckyYH/VT-ADA.

Vision Transformer-based Adversarial Domain Adaptation

TL;DR

This paper tackles unsupervised domain adaptation under domain shift by introducing VT-ADA, which replaces CNN backbones with Vision Transformer (ViT) as the feature extractor in adversarial domain adaptation frameworks. The authors implement ViT within both DANN and CDAN paradigms and demonstrate that VT-ADA yields superior domain-invariant representations across Office-31, ImageCLEF, and Office-Home benchmarks, with particularly strong performance for VT-ADA(CDAN). They provide empirical evidence of improved transferability, discriminability, and faster convergence, supporting the claim that ViT can serve as a plug-and-play component in ADA. The work highlights the practical potential of ViT to enhance cross-domain generalization in vision tasks with limited target labels and offers a public code release for reproducibility.

Abstract

Unsupervised domain adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. The most recent UDA methods always resort to adversarial training to yield state-of-the-art results and a dominant number of existing UDA methods employ convolutional neural networks (CNNs) as feature extractors to learn domain invariant features. Vision transformer (ViT) has attracted tremendous attention since its emergence and has been widely used in various computer vision tasks, such as image classification, object detection, and semantic segmentation, yet its potential in adversarial domain adaptation has never been investigated. In this paper, we fill this gap by employing the ViT as the feature extractor in adversarial domain adaptation. Moreover, we empirically demonstrate that ViT can be a plug-and-play component in adversarial domain adaptation, which means directly replacing the CNN-based feature extractor in existing UDA methods with the ViT-based feature extractor can easily obtain performance improvement. The code is available at https://github.com/LluckyYH/VT-ADA.
Paper Structure (14 sections, 6 equations, 2 figures, 3 tables)

This paper contains 14 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: t-SNEvan2008visualizing visualization results for (a) DANN, (b) CDAN, (c) CDAN+E, and (d) VT-ADA(CDAN). (Red points represent data points of domain A, while blue points represent data points of domain W)
  • Figure 2: Convergence