Table of Contents
Fetching ...

Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation

Kuanghong Liu, Jin Wang, Kangjian He, Dan Xu, Xuejie Zhang

TL;DR

This work addresses the challenge of low-resource, edge-friendly multi-source few-shot domain adaptation by proposing UMFDA, a decentralized schema that enables uploadable, domain-specific prompts. It introduces Vision-aware Multimodal Prompt Tuning (VAMP), where vision prompts guide domain-specific text prompts to preserve semantic discriminability and encode domain information within the CLIP framework. The VAMP framework operates in a decentralized manner with four losses—cross-modal semantic alignment (CSA), domain distribution alignment (DDA), text classifier consistency (TCC), and text semantic diversity (TSD)—to optimize edge-side models and foster collaboration across devices, with centralized inference through averaged logits. Experiments on OfficeHome and DomainNet show VAMP outperforming prior prompt-tuning baselines, demonstrating strong effectiveness in UMFDA and practical potential for edge deployment and privacy-preserving learning.

Abstract

Conventional multi-source domain few-shot adaptation (MFDA) faces the challenge of further reducing the load on edge-side devices in low-resource scenarios. Considering the native language-supervised advantage of CLIP and the plug-and-play nature of prompt to transfer CLIP efficiently, this paper introduces an uploadable multi-source few-shot domain adaptation (UMFDA) schema. It belongs to a decentralized edge collaborative learning in the edge-side models that must maintain a low computational load. And only a limited amount of annotations in source domain data is provided, with most of the data being unannotated. Further, this paper proposes a vision-aware multimodal prompt tuning framework (VAMP) under the decentralized schema, where the vision-aware prompt guides the text domain-specific prompt to maintain semantic discriminability and perceive the domain information. The cross-modal semantic and domain distribution alignment losses optimize each edge-side model, while text classifier consistency and semantic diversity losses promote collaborative learning among edge-side models. Extensive experiments were conducted on OfficeHome and DomainNet datasets to demonstrate the effectiveness of the proposed VAMP in the UMFDA, which outperformed the previous prompt tuning methods.

Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation

TL;DR

This work addresses the challenge of low-resource, edge-friendly multi-source few-shot domain adaptation by proposing UMFDA, a decentralized schema that enables uploadable, domain-specific prompts. It introduces Vision-aware Multimodal Prompt Tuning (VAMP), where vision prompts guide domain-specific text prompts to preserve semantic discriminability and encode domain information within the CLIP framework. The VAMP framework operates in a decentralized manner with four losses—cross-modal semantic alignment (CSA), domain distribution alignment (DDA), text classifier consistency (TCC), and text semantic diversity (TSD)—to optimize edge-side models and foster collaboration across devices, with centralized inference through averaged logits. Experiments on OfficeHome and DomainNet show VAMP outperforming prior prompt-tuning baselines, demonstrating strong effectiveness in UMFDA and practical potential for edge deployment and privacy-preserving learning.

Abstract

Conventional multi-source domain few-shot adaptation (MFDA) faces the challenge of further reducing the load on edge-side devices in low-resource scenarios. Considering the native language-supervised advantage of CLIP and the plug-and-play nature of prompt to transfer CLIP efficiently, this paper introduces an uploadable multi-source few-shot domain adaptation (UMFDA) schema. It belongs to a decentralized edge collaborative learning in the edge-side models that must maintain a low computational load. And only a limited amount of annotations in source domain data is provided, with most of the data being unannotated. Further, this paper proposes a vision-aware multimodal prompt tuning framework (VAMP) under the decentralized schema, where the vision-aware prompt guides the text domain-specific prompt to maintain semantic discriminability and perceive the domain information. The cross-modal semantic and domain distribution alignment losses optimize each edge-side model, while text classifier consistency and semantic diversity losses promote collaborative learning among edge-side models. Extensive experiments were conducted on OfficeHome and DomainNet datasets to demonstrate the effectiveness of the proposed VAMP in the UMFDA, which outperformed the previous prompt tuning methods.

Paper Structure

This paper contains 19 sections, 16 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: The illustration of uploadable multi-source few-shot domain adaptation (UMFDA) schema for decentralized edge learning.
  • Figure 2: Summary of various prompt tuning technologies (best viewed in color). (a) concludes the several prevalent prompt tuning methods while they are domain-agnostic. (b) represents the typical prompt tuning methods of single-source domain adaptation focusing on disentangling the prompts to explore the difference between the source and target domains. It must be further aligned in the center device among multiple source domains. (c) is our proposed vision-aware multimodal prompt tuning method tailored for the UMFDA.
  • Figure 3: (a) Illustration of a conceptual diagram of domain alignment. The first alignment approach at the top, multiple source domains aligning with the target domain, is unsuitable for the decentralized edge computing scenario because the centralized model needs more aligning training. It is also hard to match all distributions of source and target domains. The second idea inspired by Zhu et al. Zhu2019, pair-wise alignment between source and target domains, is thus adopted by the VAMP framework. (b) The proposed decentralized training framework VAMP. For clarity, only two edge-side models are drawn here.
  • Figure 4: PCA visualizations of the extracted image features from the source and target domains in different domain-specific models of "APR→C". The first row of pictures denotes that the projection direction of prompts is from text to vision; the second row represents the vision-to-text projection that is used in the VAMP. Contour lines enclose the regions with high density of data points.
  • Figure 5: t-SNE visualization of the image and text features of target domain extracted by "Clipart-Real World" model of VAMP and DAPL. The statistics of either intra-class visual variance or inter-class text variance are shown at the top of the subfigure.
  • ...and 1 more figures