Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

Xinyao Li; Yuke Li; Zhekai Du; Fengling Li; Ke Lu; Jingjing Li

Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

Xinyao Li, Yuke Li, Zhekai Du, Fengling Li, Ke Lu, Jingjing Li

TL;DR

UniMoS addresses unsupervised domain adaptation for vision-language models by explicitly disentangling CLIP’s vision features into language-associated and vision-associated components, enabling joint yet modality-specific adaptation. It introduces Modality-Ensemble Training (MET) and a modality discriminator to align LAC and VAC across domains while preserving pretrained semantics, achieving state-of-the-art or competitive results with low computational overhead since CLIP parameters are not fine-tuned. Key contributions include revealing the modality gap's impact on UDA, proposing a practical multimodal separation framework, and demonstrating strong performance across Office-Home, VisDA-2017, DomainNet, and Mini-DomainNet with robust ablations and efficiency analyses. The approach offers a scalable, data-efficient path for leveraging multimodal priors in UDA, with broad implications for deploying VLMs in real-world cross-domain tasks.

Abstract

Large vision-language models (VLMs) like CLIP have demonstrated good zero-shot learning performance in the unsupervised domain adaptation task. Yet, most transfer approaches for VLMs focus on either the language or visual branches, overlooking the nuanced interplay between both modalities. In this work, we introduce a Unified Modality Separation (UniMoS) framework for unsupervised domain adaptation. Leveraging insights from modality gap studies, we craft a nimble modality separation network that distinctly disentangles CLIP's features into language-associated and vision-associated components. Our proposed Modality-Ensemble Training (MET) method fosters the exchange of modality-agnostic information while maintaining modality-specific nuances. We align features across domains using a modality discriminator. Comprehensive evaluations on three benchmarks reveal our approach sets a new state-of-the-art with minimal computational costs. Code: https://github.com/TL-UESTC/UniMoS

Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

TL;DR

Abstract

Paper Structure (16 sections, 25 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 16 sections, 25 equations, 6 figures, 10 tables, 1 algorithm.

Introduction
Related work
Method
Problem formulation
Modality separation networks
Modality-ensemble training
Aligning source and target by discriminator
Training and inference
Experiments
Datasets and implementation details
Benchmark results
Ablation study
Discussions
Conclusions
Training algorithm
...and 1 more sections

Figures (6)

Figure 1: Examples of modality-specific information from task Art$\rightarrow$RealWorld in Office-Home dataset. The digits are top-2 highest classification probabilities given by both modalities.
Figure 2: Framework of our method. We freeze the pretrained vision and text encoder of CLIP. CLIP-extracted vision features are disentangled into language-associated components ($f_{lac}$) and vision-associated components ($f_{vac}$) by the modality separation networks. We obtain zero-shot results from CLIP as teacher knowledge, and distill the knowledge to LAC. We then introduce a weight generator to assemble the modality outputs to train VAC. A modality discriminator is applied to align LAC and VAC from both domains.
Figure 3: Effects of learnable ensemble weight $w$ on Office-Home.
Figure 4: T-sne visualization van2008visualizing of the effects of UniMoS on A$\to$P task from Office-Home. UniMoS effectively disentangles CLIP-extracted vision features (\ref{['tsne_init']}) into LAC and VAC (\ref{['tsne_sep']}, obtained by randomly selecting 25 classes), and constructs clear cross-domain locality structures (\ref{['tsne_bott']}).
Figure 5: Parameter sensitivity analysis on $\alpha$ and $\beta$ of UniMoS.
...and 1 more figures

Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

TL;DR

Abstract

Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)