Global and Local Prompts Cooperation via Optimal Transport for Federated Learning
Hongxia Li, Wei Huang, Jingya Wang, Ye Shi
TL;DR
This work tackles data heterogeneity in federated learning for vision-language models by introducing FedOTP, which jointly learns a global prompt for cross-client consensus and a personalized local prompt for client-specific traits within a CLIP-based framework. It leverages unbalanced Optimal Transport to align local visual features with both prompts, enabling selective focus on the most relevant image patches and a fast Dykstra-based solver for efficiency. The authors provide a generalization bound under Lipschitz assumptions and demonstrate that FedOTP outperforms state-of-the-art prompt-based and traditional PFL methods across diverse label-shift and feature-shift scenarios, with qualitative visualizations showing distinct roles for global and local prompts. The approach reduces communication and preserves personalization, offering robust performance in highly heterogeneous settings with practical implications for scalable, privacy-preserving deployment of vision-language models.
Abstract
Prompt learning in pretrained visual-language models has shown remarkable flexibility across various downstream tasks. Leveraging its inherent lightweight nature, recent research attempted to integrate the powerful pretrained models into federated learning frameworks to simultaneously reduce communication costs and promote local training on insufficient data. Despite these efforts, current federated prompt learning methods lack specialized designs to systematically address severe data heterogeneities, e.g., data distribution with both label and feature shifts involved. To address this challenge, we present Federated Prompts Cooperation via Optimal Transport (FedOTP), which introduces efficient collaborative prompt learning strategies to capture diverse category traits on a per-client basis. Specifically, for each client, we learn a global prompt to extract consensus knowledge among clients, and a local prompt to capture client-specific category characteristics. Unbalanced Optimal Transport is then employed to align local visual features with these prompts, striking a balance between global consensus and local personalization. By relaxing one of the equality constraints, FedOTP enables prompts to focus solely on the core regions of image patches. Extensive experiments on datasets with various types of heterogeneities have demonstrated that our FedOTP outperforms the state-of-the-art methods.
