Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation

Kuanghong Liu; Jin Wang; Kangjian He; Dan Xu; Xuejie Zhang

Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation

Kuanghong Liu, Jin Wang, Kangjian He, Dan Xu, Xuejie Zhang

TL;DR

This work addresses the challenge of low-resource, edge-friendly multi-source few-shot domain adaptation by proposing UMFDA, a decentralized schema that enables uploadable, domain-specific prompts. It introduces Vision-aware Multimodal Prompt Tuning (VAMP), where vision prompts guide domain-specific text prompts to preserve semantic discriminability and encode domain information within the CLIP framework. The VAMP framework operates in a decentralized manner with four losses—cross-modal semantic alignment (CSA), domain distribution alignment (DDA), text classifier consistency (TCC), and text semantic diversity (TSD)—to optimize edge-side models and foster collaboration across devices, with centralized inference through averaged logits. Experiments on OfficeHome and DomainNet show VAMP outperforming prior prompt-tuning baselines, demonstrating strong effectiveness in UMFDA and practical potential for edge deployment and privacy-preserving learning.

Abstract

Conventional multi-source domain few-shot adaptation (MFDA) faces the challenge of further reducing the load on edge-side devices in low-resource scenarios. Considering the native language-supervised advantage of CLIP and the plug-and-play nature of prompt to transfer CLIP efficiently, this paper introduces an uploadable multi-source few-shot domain adaptation (UMFDA) schema. It belongs to a decentralized edge collaborative learning in the edge-side models that must maintain a low computational load. And only a limited amount of annotations in source domain data is provided, with most of the data being unannotated. Further, this paper proposes a vision-aware multimodal prompt tuning framework (VAMP) under the decentralized schema, where the vision-aware prompt guides the text domain-specific prompt to maintain semantic discriminability and perceive the domain information. The cross-modal semantic and domain distribution alignment losses optimize each edge-side model, while text classifier consistency and semantic diversity losses promote collaborative learning among edge-side models. Extensive experiments were conducted on OfficeHome and DomainNet datasets to demonstrate the effectiveness of the proposed VAMP in the UMFDA, which outperformed the previous prompt tuning methods.

Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation

TL;DR

Abstract

Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)