Combining inherent knowledge of vision-language models with unsupervised domain adaptation through strong-weak guidance

Thomas Westfechtel; Dexuan Zhang; Tatsuya Harada

Combining inherent knowledge of vision-language models with unsupervised domain adaptation through strong-weak guidance

Thomas Westfechtel, Dexuan Zhang, Tatsuya Harada

TL;DR

This work tackles unsupervised domain adaptation by leveraging the inherent knowledge of vision-language models, notably CLIP, and fusing it with source-domain knowledge through a strong–weak guidance framework. It introduces strong guidance via source-domain expansion with the most confident target samples and weak guidance via a knowledge-distillation loss using shifted zero-shot predictions, optimizing a joint objective $L = L_{CE} + L_{KD} + L_{AD}$. The method uses CDAN as the base domain-adaptation loss, enhances performance with batch-norm adaptations, refined zero-shot processing, and optional integration with DAPL for prompt-learning-based adaptation. Empirical results on Office-Home, VisDA, and DomainNet show consistent improvements over state-of-the-art baselines, with ablations highlighting the contributions of each component and demonstrating compatibility with existing prompt-learning approaches.

Abstract

Unsupervised domain adaptation (UDA) tries to overcome the tedious work of labeling data by leveraging a labeled source dataset and transferring its knowledge to a similar but different target dataset. Meanwhile, current vision-language models exhibit remarkable zero-shot prediction capabilities. In this work, we combine knowledge gained through UDA with the inherent knowledge of vision-language models. We introduce a strong-weak guidance learning scheme that employs zero-shot predictions to help align the source and target dataset. For the strong guidance, we expand the source dataset with the most confident samples of the target dataset. Additionally, we employ a knowledge distillation loss as weak guidance. The strong guidance uses hard labels but is only applied to the most confident predictions from the target dataset. Conversely, the weak guidance is employed to the whole dataset but uses soft labels. The weak guidance is implemented as a knowledge distillation loss with (shifted) zero-shot predictions. We show that our method complements and benefits from prompt adaptation techniques for vision-language models. We conduct experiments and ablation studies on three benchmarks (OfficeHome, VisDA, and DomainNet), outperforming state-of-the-art methods. Our ablation studies further demonstrate the contributions of different components of our algorithm.

Combining inherent knowledge of vision-language models with unsupervised domain adaptation through strong-weak guidance

TL;DR

. The method uses CDAN as the base domain-adaptation loss, enhances performance with batch-norm adaptations, refined zero-shot processing, and optional integration with DAPL for prompt-learning-based adaptation. Empirical results on Office-Home, VisDA, and DomainNet show consistent improvements over state-of-the-art baselines, with ablations highlighting the contributions of each component and demonstrating compatibility with existing prompt-learning approaches.

Abstract

Paper Structure (18 sections, 7 equations, 4 figures, 6 tables)

This paper contains 18 sections, 7 equations, 4 figures, 6 tables.

Introduction
Related works
Unsupervised domain adaptation
Vision-language models
Domain adaptation for vision-language models
Methodology
Strong guidance - Source Domain Expansion:
Weak guidance - Knowledge distillation loss
Adversarial loss
Further improvements:
Batch norm layer adjustment
Zero-shot predictions
DAPL
Experiments
Experiment settings
...and 3 more sections

Figures (4)

Figure 1: Accuracy on the OfficeHome dataset for unsupervised domain adaptation (blue), CLIP zero-shot predictions (green), and our combined version (black) integrating zero-shot predictions into UDA. In this work, we present a way to combine the knowledge from vision-language models with knowledge transferred via UDA from a source domain. It can be seen that the performance significantly improves.
Figure 2: Process flow of our algorithm. In a first step, the zero-shot predictions of source $\mathcal{D}_s$ and target $\mathcal{D}_t$ dataset are estimated. We shift the zero-shot predictions distribution through a temperature parameter in the softmax to accentuate the winning probability. Based on the zero-shot predictions, we extend the source dataset with high confident target samples. These samples are treated as source data, with their respective pseudo-labels and represent the strong guidance of our method. The network is then trained using a classification loss $L_{CE}$ for the (expanded) source data, a knowledge distillation loss $L_{KD}$ employing the shifted zero-shot predictions $\tilde{y}_o$, and an adversarial adaptation loss $L_{DA}$. The knowledge distillation loss represents the weak guidance of the method, as it is employed for all samples and uses the soft zero-shot predictions.
Figure 3: Accuracy for adaptation of OfficeHome dataset for different source domain expansion percentages. The green line employs only the strong guidance, while the orange line employs both guidance. Additionally, the baselines of only using the adversarial domain adaptation (CDAN) and the CLIP zero-shot accuracy (ZS) are plotted.
Figure 4: Accuracy for adaptation of OfficeHome C$\rightarrow$A for different values for $\tau$. None represents directly using the zero-shot predictions without adjusting the output probabilities. It can be seen that adjusting the probability distribution is fundamental for the knowledge distillation to work properly.

Combining inherent knowledge of vision-language models with unsupervised domain adaptation through strong-weak guidance

TL;DR

Abstract

Combining inherent knowledge of vision-language models with unsupervised domain adaptation through strong-weak guidance

Authors

TL;DR

Abstract

Table of Contents

Figures (4)