Table of Contents
Fetching ...

EUDA: An Efficient Unsupervised Domain Adaptation via Self-Supervised Vision Transformer

Ali Abedi, Q. M. Jonathan Wu, Ning Zhang, Farhad Pourpanah

TL;DR

This paper tackles the inefficiency of state-of-the-art unsupervised domain adaptation methods by proposing EUDA, which employs a frozen DINOv2 self-supervised vision transformer as a feature extractor and a compact fully connected bottleneck. It introduces the Synergistic Domain Alignment Loss (SDAL), a weighted combination of cross-entropy and maximum mean discrepancy losses, to jointly minimize source classification errors and align source-target feature distributions. Empirical results across Office-31, Office-Home, VisDA-2017, and DomainNet show that EUDA achieves competitive or superior accuracy while dramatically reducing trainable parameters (up to 99.7% fewer in DomainNet), highlighting strong potential for resource-constrained settings. The work demonstrates the practicality of using self-supervised ViT backbones for efficient domain adaptation and suggests broader applications in on-edge environments and safety-critical domains.

Abstract

Unsupervised domain adaptation (UDA) aims to mitigate the domain shift issue, where the distribution of training (source) data differs from that of testing (target) data. Many models have been developed to tackle this problem, and recently vision transformers (ViTs) have shown promising results. However, the complexity and large number of trainable parameters of ViTs restrict their deployment in practical applications. This underscores the need for an efficient model that not only reduces trainable parameters but also allows for adjustable complexity based on specific needs while delivering comparable performance. To achieve this, in this paper we introduce an Efficient Unsupervised Domain Adaptation (EUDA) framework. EUDA employs the DINOv2, which is a self-supervised ViT, as a feature extractor followed by a simplified bottleneck of fully connected layers to refine features for enhanced domain adaptation. Additionally, EUDA employs the synergistic domain alignment loss (SDAL), which integrates cross-entropy (CE) and maximum mean discrepancy (MMD) losses, to balance adaptation by minimizing classification errors in the source domain while aligning the source and target domain distributions. The experimental results indicate the effectiveness of EUDA in producing comparable results as compared with other state-of-the-art methods in domain adaptation with significantly fewer trainable parameters, between 42% to 99.7% fewer. This showcases the ability to train the model in a resource-limited environment. The code of the model is available at: https://github.com/A-Abedi/EUDA.

EUDA: An Efficient Unsupervised Domain Adaptation via Self-Supervised Vision Transformer

TL;DR

This paper tackles the inefficiency of state-of-the-art unsupervised domain adaptation methods by proposing EUDA, which employs a frozen DINOv2 self-supervised vision transformer as a feature extractor and a compact fully connected bottleneck. It introduces the Synergistic Domain Alignment Loss (SDAL), a weighted combination of cross-entropy and maximum mean discrepancy losses, to jointly minimize source classification errors and align source-target feature distributions. Empirical results across Office-31, Office-Home, VisDA-2017, and DomainNet show that EUDA achieves competitive or superior accuracy while dramatically reducing trainable parameters (up to 99.7% fewer in DomainNet), highlighting strong potential for resource-constrained settings. The work demonstrates the practicality of using self-supervised ViT backbones for efficient domain adaptation and suggests broader applications in on-edge environments and safety-critical domains.

Abstract

Unsupervised domain adaptation (UDA) aims to mitigate the domain shift issue, where the distribution of training (source) data differs from that of testing (target) data. Many models have been developed to tackle this problem, and recently vision transformers (ViTs) have shown promising results. However, the complexity and large number of trainable parameters of ViTs restrict their deployment in practical applications. This underscores the need for an efficient model that not only reduces trainable parameters but also allows for adjustable complexity based on specific needs while delivering comparable performance. To achieve this, in this paper we introduce an Efficient Unsupervised Domain Adaptation (EUDA) framework. EUDA employs the DINOv2, which is a self-supervised ViT, as a feature extractor followed by a simplified bottleneck of fully connected layers to refine features for enhanced domain adaptation. Additionally, EUDA employs the synergistic domain alignment loss (SDAL), which integrates cross-entropy (CE) and maximum mean discrepancy (MMD) losses, to balance adaptation by minimizing classification errors in the source domain while aligning the source and target domain distributions. The experimental results indicate the effectiveness of EUDA in producing comparable results as compared with other state-of-the-art methods in domain adaptation with significantly fewer trainable parameters, between 42% to 99.7% fewer. This showcases the ability to train the model in a resource-limited environment. The code of the model is available at: https://github.com/A-Abedi/EUDA.
Paper Structure (22 sections, 3 equations, 4 figures, 9 tables, 1 algorithm)

This paper contains 22 sections, 3 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Conceptual diagram representing the impact of the SDAL in aligning the data distributions of the source and target domains while also enhancing classification accuracy. By integrating MMD with CE loss, SDAL effectively minimizes domain discrepancies and optimizes classification outcomes in a simulated environment.
  • Figure 2: The architecture of the proposed EUDA model. The process begins with extracting features from both labeled source and unlabeled target domains by a pre-trained self-supervised ViT. The extracted features pass through a bottleneck consisting of several fully connected layers. The output from the bottleneck is utilized in two ways: to compute the MMD loss and fed into the classification head. The classification results on the source domain are then applied to calculate the CE component of the SDAL, which combines MMD loss and CE loss to effectively train the model under unsupervised domain adaptation conditions.
  • Figure 3: Self-distillation with no labels. Image from caron_emerging_2021.
  • Figure 4: Attention maps of the pre-trained DINOv2 base model on an Alarm Clock from four different domains from the office-home dataset: Art, Clipart, Product, and Real World. This illustrates the model's robust feature extraction capability across diverse image contexts without any fine-tuning.