Table of Contents
Fetching ...

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation

Xiaowei Yu, Zhe Huang, Zao Zhang

TL;DR

A novel Feature Fusion Transferability Aware Transformer (FFTAT) to enhance ViT performance in UDA tasks by proposing a feature fusion technique to fuse embeddings in the latent space, enabling each embedding to incorporate information from all others, thereby improving generalization.

Abstract

Unsupervised domain adaptation (UDA) aims to leverage the knowledge learned from labeled source domains to improve performance on the unlabeled target domains. While Convolutional Neural Networks (CNNs) have been dominant in previous UDA methods, recent research has shown promise in applying Vision Transformers (ViTs) to this task. In this study, we propose a novel Feature Fusion Transferability Aware Transformer (FFTAT) to enhance ViT performance in UDA tasks. Our method introduces two key innovations: First, we introduce a patch discriminator to evaluate the transferability of patches, generating a transferability matrix. We integrate this matrix into self-attention, directing the model to focus on transferable patches. Second, we propose a feature fusion technique to fuse embeddings in the latent space, enabling each embedding to incorporate information from all others, thereby improving generalization. These two components work in synergy to enhance feature representation learning. Extensive experiments on widely used benchmarks demonstrate that our method significantly improves UDA performance, achieving state-of-the-art (SOTA) results.

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation

TL;DR

A novel Feature Fusion Transferability Aware Transformer (FFTAT) to enhance ViT performance in UDA tasks by proposing a feature fusion technique to fuse embeddings in the latent space, enabling each embedding to incorporate information from all others, thereby improving generalization.

Abstract

Unsupervised domain adaptation (UDA) aims to leverage the knowledge learned from labeled source domains to improve performance on the unlabeled target domains. While Convolutional Neural Networks (CNNs) have been dominant in previous UDA methods, recent research has shown promise in applying Vision Transformers (ViTs) to this task. In this study, we propose a novel Feature Fusion Transferability Aware Transformer (FFTAT) to enhance ViT performance in UDA tasks. Our method introduces two key innovations: First, we introduce a patch discriminator to evaluate the transferability of patches, generating a transferability matrix. We integrate this matrix into self-attention, directing the model to focus on transferable patches. Second, we propose a feature fusion technique to fuse embeddings in the latent space, enabling each embedding to incorporate information from all others, thereby improving generalization. These two components work in synergy to enhance feature representation learning. Extensive experiments on widely used benchmarks demonstrate that our method significantly improves UDA performance, achieving state-of-the-art (SOTA) results.

Paper Structure

This paper contains 17 sections, 15 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Illustration of transferability graph-guided self-attention and feature fusion. The section above (green) shows the transferability graph guided self-attention and compares it with vanilla self-attention. The section below (orange) illustrates the feature fusion mechanism, where the features of each sample are summed with the features of all other images in the same batch. Each bar represents the features of a sample.
  • Figure 2: The overview of the FFTAT framework. In FFTAT, source and target images are divided into non-overlapping fixed-size patches which are linearly projected into the latent space and concatenated with positional information. A class token is prepended to the image tokens. The tokens are subsequently processed by a transformer encoder. The Feature Fusion Layer mixes the features as illustrated in Fig. \ref{['component_illustration']}. The patch discriminator assesses the transferability of each patch and generates a transferability graph, which is used to guide the attention mechanism in the transformer layers. The classifier head and self-clustering module operate on source domain images and target domain images, respectively. The Domain Discriminator predicts whether an image belongs to the source or target domain.
  • Figure 3: The learned transferability graphs (adjacency matrices) from randomly selected domain adaptation tasks. The weight increases with the intensity of red colors while decreasing with the intensity of blue colors.