Table of Contents
Fetching ...

CAPT: Class-Aware Prompt Tuning for Federated Long-Tailed Learning with Vision-Language Model

Shihao Hou, Xinyi Shang, Shreyank N Gowda, Yang Lu, Chao Wu, Yan Yan, Hanzi Wang

TL;DR

This work addresses federated learning under the dual challenges of non-IID data and long-tailed class distributions by introducing CAPT, a framework that leverages a pre-trained vision-language model through a dual-prompt design and heterogeneity-aware client clustering. The method combines a general prompt for domain-invariant representation with class-aware prompts for per-class discrimination, all aligned via a vision-language mapping and a joint contrastive objective. A two-pronged clustering strategy groups clients by distribution similarity and complementary head-tail coverage, complemented by a Multi-Armed Bandit scheduler for communication efficiency. Comprehensive experiments across CIFAR-10-LT, CIFAR-100-LT, Fashion-MNIST-LT, and ImageNet-LT demonstrate CAPT’s ability to substantially improve tail-class performance while maintaining competitive overall accuracy, outperforming state-of-the-art prompt-tuning and FL methods. The work provides theoretical insight into why traditional prompt tuning struggles in federated long-tailed settings and offers a practical, scalable solution with broad implications for robust, fair, and efficient federated learning using vision-language models.

Abstract

Effectively handling the co-occurrence of non-IID data and long-tailed distributions remains a critical challenge in federated learning. While fine-tuning vision-language models (VLMs) like CLIP has shown to be promising in addressing non-IID data challenges, this approach leads to severe degradation of tail classes in federated long-tailed scenarios. Under the composite effects of strong non-IID data distribution and long-tailed class imbalances, VLM fine-tuning may even fail to yield any improvement. To address this issue, we propose Class-Aware Prompt Learning for Federated Long-tailed Learning (CAPT), a novel framework that leverages a pre-trained VLM to effectively handle both data heterogeneity and long-tailed distributions. CAPT introduces a dual-prompt mechanism that synergizes general and class-aware prompts, enabling the framework to capture global trends while preserving class-specific knowledge. To better aggregate and share knowledge across clients, we introduce a heterogeneity-aware client clustering strategy that groups clients based on their data distributions, enabling efficient collaboration and knowledge sharing. Extensive experiments on various long-tailed datasets with different levels of data heterogeneity demonstrate that CAPT significantly improves tail class performance without compromising overall accuracy, outperforming state-of-the-art methods in federated long-tailed learning scenarios.

CAPT: Class-Aware Prompt Tuning for Federated Long-Tailed Learning with Vision-Language Model

TL;DR

This work addresses federated learning under the dual challenges of non-IID data and long-tailed class distributions by introducing CAPT, a framework that leverages a pre-trained vision-language model through a dual-prompt design and heterogeneity-aware client clustering. The method combines a general prompt for domain-invariant representation with class-aware prompts for per-class discrimination, all aligned via a vision-language mapping and a joint contrastive objective. A two-pronged clustering strategy groups clients by distribution similarity and complementary head-tail coverage, complemented by a Multi-Armed Bandit scheduler for communication efficiency. Comprehensive experiments across CIFAR-10-LT, CIFAR-100-LT, Fashion-MNIST-LT, and ImageNet-LT demonstrate CAPT’s ability to substantially improve tail-class performance while maintaining competitive overall accuracy, outperforming state-of-the-art prompt-tuning and FL methods. The work provides theoretical insight into why traditional prompt tuning struggles in federated long-tailed settings and offers a practical, scalable solution with broad implications for robust, fair, and efficient federated learning using vision-language models.

Abstract

Effectively handling the co-occurrence of non-IID data and long-tailed distributions remains a critical challenge in federated learning. While fine-tuning vision-language models (VLMs) like CLIP has shown to be promising in addressing non-IID data challenges, this approach leads to severe degradation of tail classes in federated long-tailed scenarios. Under the composite effects of strong non-IID data distribution and long-tailed class imbalances, VLM fine-tuning may even fail to yield any improvement. To address this issue, we propose Class-Aware Prompt Learning for Federated Long-tailed Learning (CAPT), a novel framework that leverages a pre-trained VLM to effectively handle both data heterogeneity and long-tailed distributions. CAPT introduces a dual-prompt mechanism that synergizes general and class-aware prompts, enabling the framework to capture global trends while preserving class-specific knowledge. To better aggregate and share knowledge across clients, we introduce a heterogeneity-aware client clustering strategy that groups clients based on their data distributions, enabling efficient collaboration and knowledge sharing. Extensive experiments on various long-tailed datasets with different levels of data heterogeneity demonstrate that CAPT significantly improves tail class performance without compromising overall accuracy, outperforming state-of-the-art methods in federated long-tailed learning scenarios.

Paper Structure

This paper contains 44 sections, 7 theorems, 39 equations, 7 figures, 9 tables, 1 algorithm.

Key Result

Lemma 1

The variance of the global gradient estimator can be decomposed into the sum of the within-client variance and the between-client variance:

Figures (7)

  • Figure 1: Performance comparison between PromptFL (upper part) and our proposed CAPT (lower part) on CIFAR-100-LT and ImageNet-LT. As client heterogeneity increases ($\alpha$ decreases from 0.5 to 0.05), PromptFL exhibits an expanding performance gap between head and tail classes, with overall accuracy even falling below zero-shot CLIP baseline (dashed line). In contrast, CAPT effectively mitigates the impact of client heterogeneity and significantly reduces the head-tail performance disparity while maintaining superior overall accuracy across all settings.
  • Figure 2: An overview of our proposed CAPT framework.
  • Figure 3: Accuracy gains (%) comparison between PromptFL and CAPT relative to CLIP. CAPT achieves superior performance across head classes, tail classes, and overall categories, with particularly significant improvements on tail classes
  • Figure 4: Training dynamics comparison between PromptFL and CAPT on ImageNet-LT dataset.
  • Figure 5: t-SNE visualization of General Prompt () and Class-Aware Prompt () embeddings. The shaded area illustrates the region where General Prompt features are clustered.
  • ...and 2 more figures

Theorems & Definitions (16)

  • Definition 1: Imbalance Ratio
  • Definition 2: Long-Tailed Distribution
  • Lemma 1: Gradient Variance Decomposition
  • proof
  • Lemma 2: Impact of Class Imbalance on Gradient Discrepancy
  • proof
  • Theorem 1: Convergence Difficulty in Traditional Prompt Tuning
  • proof
  • Lemma 3: Distribution Discrepancy Under Long-Tailed Setting
  • proof
  • ...and 6 more