Table of Contents
Fetching ...

FAA-CLIP: Federated Adversarial Adaptation of CLIP

Yihang Wu, Ahmad Chaddad, Christian Desrosiers, Tareef Daqqaq, Reem Kateb

TL;DR

FAA-CLIP tackles deploying vision-language models in federated learning by freezing the CLIP backbone and training a light-weight feature adaptation module (FAM) per client, while employing a domain adaptation module to mitigate inter-client distribution shifts. The global server aggregates only the FAM parameters, dramatically reducing communication and computation costs. Across six natural and medical imaging datasets, FAA-CLIP consistently outperforms strong FL baselines and improves calibration and ROC-AUC metrics in medical tasks. This work demonstrates that domain-adversarial adaptation paired with compact adapters can unlock CLIP's potential in privacy-preserving, heterogeneous healthcare and vision tasks, with publicly available code.

Abstract

Despite the remarkable performance of vision language models (VLMs) such as Contrastive Language Image Pre-training (CLIP), the large size of these models is a considerable obstacle to their use in federated learning (FL) systems where the parameters of local client models need to be transferred to a global server for aggregation. Another challenge in FL is the heterogeneity of data from different clients, which affects the generalization performance of the solution. In addition, natural pre-trained VLMs exhibit poor generalization ability in the medical datasets, suggests there exists a domain gap. To solve these issues, we introduce a novel method for the Federated Adversarial Adaptation (FAA) of CLIP. Our method, named FAA-CLIP, handles the large communication costs of CLIP using a light-weight feature adaptation module (FAM) for aggregation, effectively adapting this VLM to each client's data while greatly reducing the number of parameters to transfer. By keeping CLIP frozen and only updating the FAM parameters, our method is also computationally efficient. Unlike existing approaches, our FAA-CLIP method directly addresses the problem of domain shifts across clients via a domain adaptation (DA) module. This module employs a domain classifier to predict if a given sample is from the local client or the global server, allowing the model to learn domain-invariant representations. Extensive experiments on six different datasets containing both natural and medical images demonstrate that FAA-CLIP can generalize well on both natural and medical datasets compared to recent FL approaches. Our codes are available at https://github.com/AIPMLab/FAA-CLIP.

FAA-CLIP: Federated Adversarial Adaptation of CLIP

TL;DR

FAA-CLIP tackles deploying vision-language models in federated learning by freezing the CLIP backbone and training a light-weight feature adaptation module (FAM) per client, while employing a domain adaptation module to mitigate inter-client distribution shifts. The global server aggregates only the FAM parameters, dramatically reducing communication and computation costs. Across six natural and medical imaging datasets, FAA-CLIP consistently outperforms strong FL baselines and improves calibration and ROC-AUC metrics in medical tasks. This work demonstrates that domain-adversarial adaptation paired with compact adapters can unlock CLIP's potential in privacy-preserving, heterogeneous healthcare and vision tasks, with publicly available code.

Abstract

Despite the remarkable performance of vision language models (VLMs) such as Contrastive Language Image Pre-training (CLIP), the large size of these models is a considerable obstacle to their use in federated learning (FL) systems where the parameters of local client models need to be transferred to a global server for aggregation. Another challenge in FL is the heterogeneity of data from different clients, which affects the generalization performance of the solution. In addition, natural pre-trained VLMs exhibit poor generalization ability in the medical datasets, suggests there exists a domain gap. To solve these issues, we introduce a novel method for the Federated Adversarial Adaptation (FAA) of CLIP. Our method, named FAA-CLIP, handles the large communication costs of CLIP using a light-weight feature adaptation module (FAM) for aggregation, effectively adapting this VLM to each client's data while greatly reducing the number of parameters to transfer. By keeping CLIP frozen and only updating the FAM parameters, our method is also computationally efficient. Unlike existing approaches, our FAA-CLIP method directly addresses the problem of domain shifts across clients via a domain adaptation (DA) module. This module employs a domain classifier to predict if a given sample is from the local client or the global server, allowing the model to learn domain-invariant representations. Extensive experiments on six different datasets containing both natural and medical images demonstrate that FAA-CLIP can generalize well on both natural and medical datasets compared to recent FL approaches. Our codes are available at https://github.com/AIPMLab/FAA-CLIP.

Paper Structure

This paper contains 12 sections, 7 equations, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: FAA-CLIP pipeline. Each client trains its local model separately, optimizing only the parameters of its local feature adaptation module (FAM) ($att_i$) and domain classifier $D_i$ using contrastive and domain adaptation losses. After receiving the local client parameters, the server aggregates them into a global (FAM) ($att^{*}$) whose parameters are transmitted back to clients.
  • Figure 2: Example of data distribution in each client using kite graph in Multi-OF dataset. $C_1$ to $C_{15}$ indicate each client while $C\_1$ to $C\_65$ represent each class.
  • Figure 3: Testing accuracy (ACC), balanced accuracy (BACC) and macro-F1 for each communication epoch using skin cancer (SC) dataset for FAA-CLIP, FedProx, FedAVG, FedCLIP, LoRA$_{r=3}$, PromptFL and MOON.
  • Figure 4: Testing accuracy of baselines and FAA-CLIP for BT, SC and HK datasets.
  • Figure 5: The ROC curves and DCA of FAA-CLIP, FedCLIP, FedAVG, FedProx and MOON in skin cancer (SC) dataset. The ROC can measure the performance of classifiers, while DCA can assess the net benefit for clinical practices. The left part is the ROC curves, while the right part is the DCA.
  • ...and 5 more figures