Table of Contents
Fetching ...

Unsupervised Domain Adaptation within Deep Foundation Latent Spaces

Dmitry Kangin, Plamen Angelov

TL;DR

This paper addresses unsupervised domain adaptation without finetuning by operating in the latent spaces of foundation models. It proposes a simple, prototype-based method that clusters source and target embeddings into prototypes and aligns them using $\ell^2$ or 2-Wasserstein distances, with classification via nearest-prototype matching. The results show that fixed ViT-based representations with distribution matching can outperform some finetuning-based UDA baselines in DomainNet, and the approach affords interpretable error analysis through prototype proximity. Limitations include inconsistent gains across backbones (e.g., DinoV2) and varying dependence on pretraining, highlighting both practical utility and areas for further improvement.

Abstract

The vision transformer-based foundation models, such as ViT or Dino-V2, are aimed at solving problems with little or no finetuning of features. Using a setting of prototypical networks, we analyse to what extent such foundation models can solve unsupervised domain adaptation without finetuning over the source or target domain. Through quantitative analysis, as well as qualitative interpretations of decision making, we demonstrate that the suggested method can improve upon existing baselines, as well as showcase the limitations of such approach yet to be solved.

Unsupervised Domain Adaptation within Deep Foundation Latent Spaces

TL;DR

This paper addresses unsupervised domain adaptation without finetuning by operating in the latent spaces of foundation models. It proposes a simple, prototype-based method that clusters source and target embeddings into prototypes and aligns them using or 2-Wasserstein distances, with classification via nearest-prototype matching. The results show that fixed ViT-based representations with distribution matching can outperform some finetuning-based UDA baselines in DomainNet, and the approach affords interpretable error analysis through prototype proximity. Limitations include inconsistent gains across backbones (e.g., DinoV2) and varying dependence on pretraining, highlighting both practical utility and areas for further improvement.

Abstract

The vision transformer-based foundation models, such as ViT or Dino-V2, are aimed at solving problems with little or no finetuning of features. Using a setting of prototypical networks, we analyse to what extent such foundation models can solve unsupervised domain adaptation without finetuning over the source or target domain. Through quantitative analysis, as well as qualitative interpretations of decision making, we demonstrate that the suggested method can improve upon existing baselines, as well as showcase the limitations of such approach yet to be solved.
Paper Structure (11 sections, 3 figures, 1 algorithm)

This paper contains 11 sections, 3 figures, 1 algorithm.

Figures (3)

  • Figure 1: The methodology scheme: (1) the images from multiple domains (e.g., sketches and real images) are embedded into the feature space and, for each domain, separately clustered using $k$-means. The cluster centroids for one of the domains ('source domain'), shown in bright colour in the figure and referred to as 'prototypes', are provided with labels. (2) Domain adaptation is performed through inter-domain cluster matches with $\ell^2$ or Wasserstein distance. (3) Decision making through nearest-neighbour prototype classifier performs the prediction
  • Figure 2: UDA results for different backbone architectures (columns denote source domains, and rows denote target domains), $k$-means clustering ($5 \times 345=1725$ clusters, $345$ classes)
  • Figure 3: Interpretations of decision making through closest prototypes (the leftmost images are queries, and the further ones are prototypes)