Table of Contents
Fetching ...

Mixture of Experts Made Personalized: Federated Prompt Learning for Vision-Language Models

Jun Luo, Chen Chen, Shandong Wu

TL;DR

The paper addresses the challenge of adapting CLIP-like vision–language models in federated settings where communicating large models is costly, by introducing pFedMoAP, a personalized federated mixture of adaptive prompts. Clients download multiple pre-aggregated prompts as non-local experts and use a client-specific attention-based gating network to fuse local and non-local prompt knowledge, enabling effective MoE-based personalization with minimal overhead. The gating network operates on a reduced feature space ($d_{gating}=128$) and supports a flexible number of experts, while a KNN-based strategy selects non-local prompts from a server pool. Empirical results across 9 datasets under diverse non-IID settings show substantial improvements over state-of-the-art federated prompt methods, including robustness to feature and label shifts and resilience under differential privacy, highlighting the practical value of shared non-local prompts for VLMs in privacy-preserving collaborative learning.

Abstract

Federated prompt learning benefits federated learning with CLIP-like Vision-Language Model's (VLM's) robust representation learning ability through prompt learning. However, current federated prompt learning methods are habitually restricted to the traditional FL paradigm, where the participating clients are generally only allowed to download a single globally aggregated model from the server. While justifiable for training full-sized models under federated settings, in this work, we argue that this paradigm is ill-suited for lightweight prompts. By facilitating the clients to download multiple pre-aggregated prompts as fixed non-local experts, we propose Personalized Federated Mixture of Adaptive Prompts (pFedMoAP), a novel FL framework that personalizes the prompt learning process through the lens of Mixture of Experts (MoE). pFedMoAP implements a local attention-based gating network that learns to generate enhanced text features for better alignment with local image data, benefiting from both local and downloaded non-local adaptive prompt experts. Extensive experiments on 9 datasets under various federated settings demonstrate the efficacy of the proposed pFedMoAP algorithm. The code is available at https://github.com/ljaiverson/pFedMoAP.

Mixture of Experts Made Personalized: Federated Prompt Learning for Vision-Language Models

TL;DR

The paper addresses the challenge of adapting CLIP-like vision–language models in federated settings where communicating large models is costly, by introducing pFedMoAP, a personalized federated mixture of adaptive prompts. Clients download multiple pre-aggregated prompts as non-local experts and use a client-specific attention-based gating network to fuse local and non-local prompt knowledge, enabling effective MoE-based personalization with minimal overhead. The gating network operates on a reduced feature space () and supports a flexible number of experts, while a KNN-based strategy selects non-local prompts from a server pool. Empirical results across 9 datasets under diverse non-IID settings show substantial improvements over state-of-the-art federated prompt methods, including robustness to feature and label shifts and resilience under differential privacy, highlighting the practical value of shared non-local prompts for VLMs in privacy-preserving collaborative learning.

Abstract

Federated prompt learning benefits federated learning with CLIP-like Vision-Language Model's (VLM's) robust representation learning ability through prompt learning. However, current federated prompt learning methods are habitually restricted to the traditional FL paradigm, where the participating clients are generally only allowed to download a single globally aggregated model from the server. While justifiable for training full-sized models under federated settings, in this work, we argue that this paradigm is ill-suited for lightweight prompts. By facilitating the clients to download multiple pre-aggregated prompts as fixed non-local experts, we propose Personalized Federated Mixture of Adaptive Prompts (pFedMoAP), a novel FL framework that personalizes the prompt learning process through the lens of Mixture of Experts (MoE). pFedMoAP implements a local attention-based gating network that learns to generate enhanced text features for better alignment with local image data, benefiting from both local and downloaded non-local adaptive prompt experts. Extensive experiments on 9 datasets under various federated settings demonstrate the efficacy of the proposed pFedMoAP algorithm. The code is available at https://github.com/ljaiverson/pFedMoAP.

Paper Structure

This paper contains 20 sections, 11 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Schematic diagram of employing Mixture of Experts into federated learning. We facilitate the sharing of pre-aggregated prompts thanks to their lightweight nature. Each client downloads the pre-aggregated prompts trained on the remaining two clients through the server, keeping them fixed locally as non-local experts.
  • Figure 2: Workflow of pFedMoAP at client $i$. The client first computes the non-local text features using the non-local prompt experts. As training progresses, it then calculates the local text features. Taking class 3 as an example, both local and non-local text features are input into the attention-based gating network as both key and value, while image features serve as the query. This process generates enhanced text features. Matching socres are derived from two sources: local text features and MoE-enhanced text features. These scores are then combined through weighted averaging to produce the final logits.
  • Figure 3: Ablation study on the number of shots.
  • Figure 4: Ablation study on the coefficient for the logits from local prompt, $\lambda$.
  • Figure 5: The impact of the number of experts on CIFAR10 with 100 clients
  • ...and 1 more figures