Unsupervised Domain Adaptation within Deep Foundation Latent Spaces
Dmitry Kangin, Plamen Angelov
TL;DR
This paper addresses unsupervised domain adaptation without finetuning by operating in the latent spaces of foundation models. It proposes a simple, prototype-based method that clusters source and target embeddings into prototypes and aligns them using $\ell^2$ or 2-Wasserstein distances, with classification via nearest-prototype matching. The results show that fixed ViT-based representations with distribution matching can outperform some finetuning-based UDA baselines in DomainNet, and the approach affords interpretable error analysis through prototype proximity. Limitations include inconsistent gains across backbones (e.g., DinoV2) and varying dependence on pretraining, highlighting both practical utility and areas for further improvement.
Abstract
The vision transformer-based foundation models, such as ViT or Dino-V2, are aimed at solving problems with little or no finetuning of features. Using a setting of prototypical networks, we analyse to what extent such foundation models can solve unsupervised domain adaptation without finetuning over the source or target domain. Through quantitative analysis, as well as qualitative interpretations of decision making, we demonstrate that the suggested method can improve upon existing baselines, as well as showcase the limitations of such approach yet to be solved.
