Unsupervised Domain Adaption Harnessing Vision-Language Pre-training

Wenlve Zhou; Zhiheng Zhou

Unsupervised Domain Adaption Harnessing Vision-Language Pre-training

Wenlve Zhou, Zhiheng Zhou

TL;DR

This work tackles unsupervised domain adaptation by exploiting large-scale vision-language pre-training. It introduces Cross-Modal Knowledge Distillation (CMKD), which uses a vision-language model's text encoder as a teacher to guide target-domain learning, and Residual Sparse Training (RST), a brain-inspired, parameter-efficient fine-tuning strategy that dramatically reduces deployment parameters without large performance loss. CMKD can be paired with existing UDA methods (e.g., FixMatch) and demonstrates state-of-the-art results across diverse benchmarks, while RST achieves substantial storage savings (down to approximately $0.1\%-0.5\%$ of downstream parameters) with minimal accuracy decay. Together, these methods enable effective, scalable, and storage-efficient UDA using vision-language pre-trained models across CNN and ViT backbones, with robust ablations and extensive benchmark results supporting their efficacy.

Abstract

This paper addresses two vital challenges in Unsupervised Domain Adaptation (UDA) with a focus on harnessing the power of Vision-Language Pre-training (VLP) models. Firstly, UDA has primarily relied on ImageNet pre-trained models. However, the potential of VLP models in UDA remains largely unexplored. The rich representation of VLP models holds significant promise for enhancing UDA tasks. To address this, we propose a novel method called Cross-Modal Knowledge Distillation (CMKD), leveraging VLP models as teacher models to guide the learning process in the target domain, resulting in state-of-the-art performance. Secondly, current UDA paradigms involve training separate models for each task, leading to significant storage overhead and impractical model deployment as the number of transfer tasks grows. To overcome this challenge, we introduce Residual Sparse Training (RST) exploiting the benefits conferred by VLP's extensive pre-training, a technique that requires minimal adjustment (approximately 0.1\%$\sim$0.5\%) of VLP model parameters to achieve performance comparable to fine-tuning. Combining CMKD and RST, we present a comprehensive solution that effectively leverages VLP models for UDA tasks while reducing storage overhead for model deployment. Furthermore, CMKD can serve as a baseline in conjunction with other methods like FixMatch, enhancing the performance of UDA. Our proposed method outperforms existing techniques on standard benchmarks. Our code will be available at: https://github.com/Wenlve-Zhou/VLP-UDA.

Unsupervised Domain Adaption Harnessing Vision-Language Pre-training

TL;DR

of downstream parameters) with minimal accuracy decay. Together, these methods enable effective, scalable, and storage-efficient UDA using vision-language pre-trained models across CNN and ViT backbones, with robust ablations and extensive benchmark results supporting their efficacy.

Abstract

0.5\%) of VLP model parameters to achieve performance comparable to fine-tuning. Combining CMKD and RST, we present a comprehensive solution that effectively leverages VLP models for UDA tasks while reducing storage overhead for model deployment. Furthermore, CMKD can serve as a baseline in conjunction with other methods like FixMatch, enhancing the performance of UDA. Our proposed method outperforms existing techniques on standard benchmarks. Our code will be available at: https://github.com/Wenlve-Zhou/VLP-UDA.

Paper Structure (16 sections, 17 equations, 4 figures, 10 tables, 2 algorithms)

This paper contains 16 sections, 17 equations, 4 figures, 10 tables, 2 algorithms.

Introduction
Related Work
Visual-Language Pre-training
Unsupervised Domain Adaption
Knowledge Distillation
Parameter Efficient Fine-tuning
Methods
Preliminary
Cross-Modal Knowledge Distillation
Residual Sparse Training
Experiments
Setup
Comparison with SoTA UDA Methods
Comparison with SoTA PEFT Methods
Ablation Study
...and 1 more sections

Figures (4)

Figure 1: The bar chart displays the average accuracy of our method and previous State-Of-The-Art (SoTA) on popular benchmarks. The statics on each bar represent the Downstream Parameters (DSP) of the respective method (The definition of DSP is detailed in Section IV). Combining CMKD with RST leads to a substantial reduction in model storage overhead, while its combination with FixMatch [74] results in further enhanced model performance. CLIP [8] represents the zero-shot inference performance on target domain.
Figure 2: Overview of VLP models tuning and UDA deployment. (a) VLP models tuning. The previous methods struggled to balance the general knowledge utilization and domain-specific representation. We introduce cross-modal knowledge distillation enabling reconciliation and remove text encoder when inference. (b) UDA deployment. Conventional UDA deployment selects domain weights with full parameters based on the application scenario. Our method learns a highly sparse weight (approximately 0.1%$\sim$0.5% of the downstream models' parameters) that can be added to the pre-trained model for deployment.
Figure 3: The pipeline of cross-modal knowledge distillation. CMKD is exclusively utilized for target data.
Figure 4: Average accuracy and DSP comparison in different predefined threshold on Office-Home. LP and FT stand for Linear-Probe and Fine-Tuning, respectively.

Unsupervised Domain Adaption Harnessing Vision-Language Pre-training

TL;DR

Abstract

Unsupervised Domain Adaption Harnessing Vision-Language Pre-training

Authors

TL;DR

Abstract

Table of Contents

Figures (4)