Table of Contents
Fetching ...

Probing the Efficacy of Federated Parameter-Efficient Fine-Tuning of Vision Transformers for Medical Image Classification

Naif Alkhunaizi, Faris Almalik, Rouqaiah Al-Refai, Muzammal Naseer, Karthik Nandakumar

TL;DR

The paper investigates federated parameter-efficient fine-tuning (PEFT) for Vision Transformers in medical image classification, addressing data scarcity, privacy, and communication constraints across institutions. It systematically evaluates multiple federated PEFT strategies, including Visual Prompt Tuning (VPT), low-rank adaptations (LoRA), decomposed prompts (DVPT), and stochastic block attention (SBA), as well as hybrid combinations, under both IID and non-IID, in-domain and out-of-domain conditions. The findings show a clear trade-off: while many methods dramatically reduce exchanged parameters, accuracy can degrade, especially for out-of-domain data and non-IID client distributions, with about a 4% accuracy drop per order of magnitude reduction in parameters in OOD scenarios. The work emphasizes the importance of starting from in-domain medical foundation models when possible and highlights the relative robustness of visual prompts over textual prompts for medical imaging tasks, informing practical deployment of federated PEFT in healthcare.

Abstract

With the advent of large pre-trained transformer models, fine-tuning these models for various downstream tasks is a critical problem. Paucity of training data, the existence of data silos, and stringent privacy constraints exacerbate this fine-tuning problem in the medical imaging domain, creating a strong need for algorithms that enable collaborative fine-tuning of pre-trained models. Moreover, the large size of these models necessitates the use of parameter-efficient fine-tuning (PEFT) to reduce the communication burden in federated learning. In this work, we systematically investigate various federated PEFT strategies for adapting a Vision Transformer (ViT) model (pre-trained on a large natural image dataset) for medical image classification. Apart from evaluating known PEFT techniques, we introduce new federated variants of PEFT algorithms such as visual prompt tuning (VPT), low-rank decomposition of visual prompts, stochastic block attention fine-tuning, and hybrid PEFT methods like low-rank adaptation (LoRA)+VPT. Moreover, we perform a thorough empirical analysis to identify the optimal PEFT method for the federated setting and understand the impact of data distribution on federated PEFT, especially for out-of-domain (OOD) and non-IID data. The key insight of this study is that while most federated PEFT methods work well for in-domain transfer, there is a substantial accuracy vs. efficiency trade-off when dealing with OOD and non-IID scenarios, which is commonly the case in medical imaging. Specifically, every order of magnitude reduction in fine-tuned/exchanged parameters can lead to a 4% drop in accuracy. Thus, the initial model choice is crucial for federated PEFT. It is preferable to use medical foundation models learned from in-domain medical image data (if available) rather than general vision models.

Probing the Efficacy of Federated Parameter-Efficient Fine-Tuning of Vision Transformers for Medical Image Classification

TL;DR

The paper investigates federated parameter-efficient fine-tuning (PEFT) for Vision Transformers in medical image classification, addressing data scarcity, privacy, and communication constraints across institutions. It systematically evaluates multiple federated PEFT strategies, including Visual Prompt Tuning (VPT), low-rank adaptations (LoRA), decomposed prompts (DVPT), and stochastic block attention (SBA), as well as hybrid combinations, under both IID and non-IID, in-domain and out-of-domain conditions. The findings show a clear trade-off: while many methods dramatically reduce exchanged parameters, accuracy can degrade, especially for out-of-domain data and non-IID client distributions, with about a 4% accuracy drop per order of magnitude reduction in parameters in OOD scenarios. The work emphasizes the importance of starting from in-domain medical foundation models when possible and highlights the relative robustness of visual prompts over textual prompts for medical imaging tasks, informing practical deployment of federated PEFT in healthcare.

Abstract

With the advent of large pre-trained transformer models, fine-tuning these models for various downstream tasks is a critical problem. Paucity of training data, the existence of data silos, and stringent privacy constraints exacerbate this fine-tuning problem in the medical imaging domain, creating a strong need for algorithms that enable collaborative fine-tuning of pre-trained models. Moreover, the large size of these models necessitates the use of parameter-efficient fine-tuning (PEFT) to reduce the communication burden in federated learning. In this work, we systematically investigate various federated PEFT strategies for adapting a Vision Transformer (ViT) model (pre-trained on a large natural image dataset) for medical image classification. Apart from evaluating known PEFT techniques, we introduce new federated variants of PEFT algorithms such as visual prompt tuning (VPT), low-rank decomposition of visual prompts, stochastic block attention fine-tuning, and hybrid PEFT methods like low-rank adaptation (LoRA)+VPT. Moreover, we perform a thorough empirical analysis to identify the optimal PEFT method for the federated setting and understand the impact of data distribution on federated PEFT, especially for out-of-domain (OOD) and non-IID data. The key insight of this study is that while most federated PEFT methods work well for in-domain transfer, there is a substantial accuracy vs. efficiency trade-off when dealing with OOD and non-IID scenarios, which is commonly the case in medical imaging. Specifically, every order of magnitude reduction in fine-tuned/exchanged parameters can lead to a 4% drop in accuracy. Thus, the initial model choice is crucial for federated PEFT. It is preferable to use medical foundation models learned from in-domain medical image data (if available) rather than general vision models.
Paper Structure (6 sections, 4 equations, 4 figures, 2 tables)

This paper contains 6 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Adaptation of Vision Transformer (ViT) model using federated PEFT methods. Only the parameters marked as trainable are exchanged between the clients and the server, while the frozen parameters are not communicated.
  • Figure 2: (a) Accuracy vs. efficiency trade-off of various federated PEFT methods (Full Fine-tuning to LoRa shown in Table \ref{['tab:methods_num_params']}). The trade-off is more pronounced for OOD transfer (Fed-ISIC2019) compared to in-domain transfer (CalTech101). (b) Accuracy of federated PEFT methods on Fed-ISIC2019 with only $5$ clients (excluding client $4$), when (Left) base model is fine-tuned first with in-domain data (client $4$ data) and (Right) base model is pre-trained using natural images. Clearly, the in-domain base model shows less performance variability.
  • Figure 3: From left to right, distribution of HAM10000 (IID), Fed-ISIC2019 (Non-IID), Flowers102 (IID), and Caltech101 (IID) datasets. Each stacked bar represents the number of training samples, and each color represents a class. Fed-ISIC2019 terrail2022flamby contains $23,247$ samples across eight melanoma classes. HAM10000 HAM1000 comprises $10,015$ dermoscopic images categorized into $7$ lesion types. We employ $80\%~(20\%)$ train (test) split for both these datasets. Caltech101 caltech101 has $101$ categories of natural images with a $50\%~(50\%)$ train (test) split. Flowers102 flowers102 includes 102 categories with a $25\% ~(75\%)$ train (test) split.
  • Figure 4: Balanced accuracy with different number of prompts for the VPT method on Fed-ISIC2019 dataset. We found that optimal performance was achieved with $R = 50$ prompts.