Learning to Prompt Your Domain for Vision-Language Models

Guoyizhe Wei; Feng Wang; Anshul Shah; Rama Chellappa

Learning to Prompt Your Domain for Vision-Language Models

Guoyizhe Wei, Feng Wang, Anshul Shah, Rama Chellappa

TL;DR

This paper tackles domain shift in federated learning by leveraging vision–language models through domain-aware prompt learning. It introduces Fed-DPT, a dual-prompt framework that uses domain-specific textual prompts and domain-aware visual prompts, fused via attention, while keeping CLIP encoders fixed to ensure efficiency. Text prompts from each domain are updated locally, while visual prompts are shared via FedAvg, and text prompts undergo momentum updates to stabilize training. Across DomainNet, OfficeHome, and PACS, Fed-DPT achieves state-of-the-art average accuracy (e.g., $68.4\%$ on DomainNet, $14.8\%$ above CLIP), with significantly reduced communication costs and strong few-shot robustness, illustrating the practicality of domain-aware prompt learning in federated vision–language tasks.

Abstract

Prompt learning has recently become a very efficient transfer learning paradigm for Contrastive Language Image Pretraining (CLIP) models. Compared with fine-tuning the entire encoder, prompt learning can obtain highly competitive results by optimizing only a small number of parameters, which presents considerably exciting benefits for federated learning applications that prioritizes communication efficiency. However, in this work, we identify that directly transferring prompt learning approaches into federated learning does not yield favorable results since the model often suffers from considerable domain gaps across different clients. To address this issue, we propose ADAPT, a novel domain-aware prompt learning approach that facilitates both intra- and inter-domain prompts across federated participants. The basic idea of ADAPT is that the prompted CLIP should detect the input image's domain correspondence and before making the prediction of its category. Extensive experiments of ADAPT demonstrate its significant efficiency and effectiveness in federated learning. For example, by learning and sharing only 0.08M parameters, our ADAPT attains a 68.4% average accuracy over six domains in the DomainNet dataset, which improves the original CLIP by a large margin of 14.8%.

Learning to Prompt Your Domain for Vision-Language Models

TL;DR

on DomainNet,

above CLIP), with significantly reduced communication costs and strong few-shot robustness, illustrating the practicality of domain-aware prompt learning in federated vision–language tasks.

Abstract

Paper Structure (15 sections, 10 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 15 sections, 10 equations, 6 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Contrastive Language-Image Models
Prompt Tuning for Vision and Language
Methodology
Problem Formulation
Local training
Parameters Aggregation
Experiments
Experimental Setup
Main Results
Privacy and Communication Cost
Ablation Studies
Conclusion

Figures (6)

Figure 1: Local training framework. We load a pre-trained CLIP model and freeze both its image and text encoders. For each client, we feed the text encoder with n text prompts followed by class names, where one is optimized by the gradients and the rest $n-1$ are loaded from other clients with momentum update. We feed the image encoder with n learnable prompt tokens followed by patch-wise embedded images, where the prompt tokens are optimized by gradients.
Figure 2: Parameter aggregation pipeline of Fed-DPT. We aggregate textual prompts by concatenating the domain-specific tokens from each client, and aggregate visual prompts by averaging.
Figure 3: Comparison of performance to fine-tuning protocols on DomainNet dataset
Figure 4: Comparison of Parameters and accuracy (%) to fine-tuning protocols on DomainNet dataset. Our results are marked in blue. The best results in each domain are bolded
Figure 5: Comparison of convergence.
...and 1 more figures

Learning to Prompt Your Domain for Vision-Language Models

TL;DR

Abstract

Learning to Prompt Your Domain for Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)