Vision transformers in domain adaptation and domain generalization: a study of robustness

Shadi Alijani; Jamil Fayyad; Homayoun Najjaran

Vision transformers in domain adaptation and domain generalization: a study of robustness

Shadi Alijani, Jamil Fayyad, Homayoun Najjaran

TL;DR

This paper investigates the deployment of vision transformers in domain adaptation and domain generalization scenarios, and categorizes diverse strategies in research into feature-level, instance-level, model-level adaptations, and hybrid approaches, along with other categorizations with respect to diverse strategies for enhancing domain adaptation.

Abstract

Deep learning models are often evaluated in scenarios where the data distribution is different from those used in the training and validation phases. The discrepancy presents a challenge for accurately predicting the performance of models once deployed on the target distribution. Domain adaptation and generalization are widely recognized as effective strategies for addressing such shifts, thereby ensuring reliable performance. The recent promising results in applying vision transformers in computer vision tasks, coupled with advancements in self-attention mechanisms, have demonstrated their significant potential for robustness and generalization in handling distribution shifts. Motivated by the increased interest from the research community, our paper investigates the deployment of vision transformers in domain adaptation and domain generalization scenarios. For domain adaptation methods, we categorize research into feature-level, instance-level, model-level adaptations, and hybrid approaches, along with other categorizations with respect to diverse strategies for enhancing domain adaptation. Similarly, for domain generalization, we categorize research into multi-domain learning, meta-learning, regularization techniques, and data augmentation strategies. We further classify diverse strategies in research, underscoring the various approaches researchers have taken to address distribution shifts by integrating vision transformers. The inclusion of comprehensive tables summarizing these categories is a distinct feature of our work, offering valuable insights for researchers. These findings highlight the versatility of vision transformers in managing distribution shifts, crucial for real-world applications, especially in critical safety and decision-making scenarios.

Vision transformers in domain adaptation and domain generalization: a study of robustness

TL;DR

Abstract

Paper Structure (29 sections, 3 equations, 10 figures)

This paper contains 29 sections, 3 equations, 10 figures.

Introduction
Vision Transformers: Fundamentals and Architecture
Overview of the Vision Transformers Architecture
Key Components and Building Blocks of Vision Transformers
Training process of Vision Transformers
Advantages of Vision Transformers compared to CNNs backbones
Vision Transformers in Domain Adaptation and Domain Generalization
Vision Transformers in Domain Adaptation
Feature-Level Adaptation
Instance-Level Adaptation
Model-Level Adaptation
Hybrid Approaches
Diverse Strategies for Enhancing Domain Adaptation
Vision Transformers in Domain Generalization
Multi-Domain Learning
...and 14 more sections

Figures (10)

Figure 1: Various factors that affect the robustness of deep learning models include: (a) displaying the original image, followed by (b) severe occlusions, (c) adversarial perturbations, (d) patch permutations, and (e) distributional shifts, such as stylization to remove texture cues.
Figure 2: (a): An image is divided into fixed-size patches, each is embedded linearly, and position embeddings are added. The sequence of vectors produced is then fed into a standard Transformer encoder. For classification purposes, an additional learnable classification token is incorporated into the sequence. (b): The Transformer's architecture is characterized by the use of stacked self-attention and point-wise, fully connected layers within both its encoder and decoder components, as depicted in the left and right sections of the figure, respectively.
Figure 3: Schematic representation of the Scaled Dot-Product Attention and Multi-Head Attention mechanisms. The top process combines queries, keys, and values to compute attention scores, while the bottom shows parallel attention layers merging in Multi-Head Attention, a core feature of transformer models for capturing varied contextual cues. The depiction of the attention mechanism inspired by khan2022transformers.
Figure 4: Our categorization of studies on adapting vision transformers to handle distribution shifts in domain adaptation and domain generalization approaches.
Figure 5: Representative Works of ViTs for DA
...and 5 more figures

Vision transformers in domain adaptation and domain generalization: a study of robustness

TL;DR

Abstract

Vision transformers in domain adaptation and domain generalization: a study of robustness

Authors

TL;DR

Abstract

Table of Contents

Figures (10)