Federated CLIP for Resource-Efficient Heterogeneous Medical Image Classification
Yihang Wu, Ahmad Chaddad
TL;DR
The paper tackles privacy-preserving medical image classification when data are distributed across hospitals, addressing data heterogeneity and high resource costs inherent to vision-language models. It proposes FedMedCLIP, a CLIP-based federated framework that freezes CLIP encoders, uses a masked FAM for efficient communication, and employs a private masked MLP with class-wise KL distillation to balance personalization and generalization. An ensemble of FAM and MLP predictions plus model compression yields strong accuracy gains (e.g., ~8% on ISIC2019 over the next-best baseline) with substantially reduced training and transmission overhead (≈120× faster than FedAVG). The approach demonstrates robust performance across multiple medical datasets, indicating practical viability for resource-constrained, privacy-sensitive clinical deployments.
Abstract
Despite the remarkable performance of deep models in medical imaging, they still require source data for training, which limits their potential in light of privacy concerns. Federated learning (FL), as a decentralized learning framework that trains a shared model with multiple hospitals (a.k.a., FL clients), provides a feasible solution. However, data heterogeneity and resource costs hinder the deployment of FL models, especially when using vision language models (VLM). To address these challenges, we propose a novel contrastive language-image pre-training (CLIP) based FL approach for medical image classification (FedMedCLIP). Specifically, we introduce a masked feature adaptation module (FAM) as a communication module to reduce the communication load while freezing the CLIP encoders to reduce the computational overhead. Furthermore, we propose a masked multi-layer perceptron (MLP) as a private local classifier to adapt to the client tasks. Moreover, we design an adaptive Kullback-Leibler (KL) divergence-based distillation regularization method to enable mutual learning between FAM and MLP. Finally, we incorporate model compression to transmit the FAM parameters while using ensemble predictions for classification. Extensive experiments on four publicly available medical datasets demonstrate that our model provides feasible performance (e.g., 8\% higher compared to second best baseline on ISIC2019) with reasonable resource cost (e.g., 120$\times$ faster than FedAVG).
