Table of Contents
Fetching ...

TransAdapter: Vision Transformer for Feature-Centric Unsupervised Domain Adaptation

A. Enes Doruk, Erhan Oztop, Hasan F. Ates

TL;DR

This paper tackles unsupervised domain adaptation for vision tasks with substantial domain gaps by introducing TransAdapter, a Swin Transformer–based framework augmented with three modules: Graph Domain Discriminator, Adaptive Double Attention, and Cross Feature Transform, plus pixel-wise transforms with pseudo labeling. The proposed components collectively enhance both local and long-range feature alignment and enable bidirectional feature transfer between source and target domains, improving generalization across domains. The authors demonstrate state-of-the-art results on Office-31, Office-Home, VisDA-2017, and DomainNet, with ablations confirming the contribution of each module. The work highlights TransAdapter's robustness and adaptability to diverse domain shifts, offering practical impact for real-world UDA tasks.

Abstract

Unsupervised Domain Adaptation (UDA) aims to utilize labeled data from a source domain to solve tasks in an unlabeled target domain, often hindered by significant domain gaps. Traditional CNN-based methods struggle to fully capture complex domain relationships, motivating the shift to vision transformers like the Swin Transformer, which excel in modeling both local and global dependencies. In this work, we propose a novel UDA approach leveraging the Swin Transformer with three key modules. A Graph Domain Discriminator enhances domain alignment by capturing inter-pixel correlations through graph convolutions and entropy-based attention differentiation. An Adaptive Double Attention module combines Windows and Shifted Windows attention with dynamic reweighting to align long-range and local features effectively. Finally, a Cross-Feature Transform modifies Swin Transformer blocks to improve generalization across domains. Extensive benchmarks confirm the state-of-the-art performance of our versatile method, which requires no task-specific alignment modules, establishing its adaptability to diverse applications.

TransAdapter: Vision Transformer for Feature-Centric Unsupervised Domain Adaptation

TL;DR

This paper tackles unsupervised domain adaptation for vision tasks with substantial domain gaps by introducing TransAdapter, a Swin Transformer–based framework augmented with three modules: Graph Domain Discriminator, Adaptive Double Attention, and Cross Feature Transform, plus pixel-wise transforms with pseudo labeling. The proposed components collectively enhance both local and long-range feature alignment and enable bidirectional feature transfer between source and target domains, improving generalization across domains. The authors demonstrate state-of-the-art results on Office-31, Office-Home, VisDA-2017, and DomainNet, with ablations confirming the contribution of each module. The work highlights TransAdapter's robustness and adaptability to diverse domain shifts, offering practical impact for real-world UDA tasks.

Abstract

Unsupervised Domain Adaptation (UDA) aims to utilize labeled data from a source domain to solve tasks in an unlabeled target domain, often hindered by significant domain gaps. Traditional CNN-based methods struggle to fully capture complex domain relationships, motivating the shift to vision transformers like the Swin Transformer, which excel in modeling both local and global dependencies. In this work, we propose a novel UDA approach leveraging the Swin Transformer with three key modules. A Graph Domain Discriminator enhances domain alignment by capturing inter-pixel correlations through graph convolutions and entropy-based attention differentiation. An Adaptive Double Attention module combines Windows and Shifted Windows attention with dynamic reweighting to align long-range and local features effectively. Finally, a Cross-Feature Transform modifies Swin Transformer blocks to improve generalization across domains. Extensive benchmarks confirm the state-of-the-art performance of our versatile method, which requires no task-specific alignment modules, establishing its adaptability to diverse applications.

Paper Structure

This paper contains 16 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The architecture of the proposed TransAdapter; MADA is multi-head adaptive double attention module, respectively.
  • Figure 2: The architecture of the Adaptive Double Attention (ADA) module is depicted. Here, $KQV$ represents the standard window attention features, while $K_{\text{shift}}Q_{\text{shift}}V_{\text{shift}}$ corresponds to the shifted window attention features. Additionally, $H$ denotes the entropy-based attention matrix, which is constructed using features derived from the graph domain discriminator.
  • Figure 3: The architecture of the Graph Domain Discriminator uses $K_s$ and $K_t$ to represent source and target key features of MADA, respectively.
  • Figure 4: The architecture of Cross Feature Transform (CFT) module. $X_s$ and $X_t$ represents source and target feature, respectively.
  • Figure 5: t-SNE visualization of Office-Home dataset, where red and blue points indicate the source and the target domain, ”-B” indicates that the backbone is Base, respectively.