Table of Contents
Fetching ...

Privacy-Preserving Chest X-ray Report Generation via Multimodal Federated Learning with ViT and GPT-2

Md. Zahid Hossain, Mustofa Ahmed, Most. Sharmin Sultana Samu, Md. Rakibul Islam

TL;DR

This work tackles privacy concerns in automated chest X-ray report generation by proposing a Multimodal Federated Learning framework that combines a Vision Transformer ($ViT$) encoder with GPT-2 for text generation. It evaluates three aggregation strategies—FedAvg, Krum, and a novel Loss-aware Federated Averaging ($L$-FedAvg)—in a four-client FL setting using the IU-Xray dataset, finding that Krum Aggregation yields the strongest lexical and semantic quality across multiple metrics such as $ROUGE$, $BLEU$, $BERTScore$, and $RaTEScore$. The study demonstrates that FL can approach or exceed centralized training performance while preserving data privacy, using a lightweight, replicable setup with Google Drive and Firebase for parameter exchange. These results underscore the feasibility of privacy-preserving, collaborative medical AI development for radiology reporting and provide practical guidance on aggregation strategies for multimodal FL tasks.

Abstract

The automated generation of radiology reports from chest X-ray images holds significant promise in enhancing diagnostic workflows while preserving patient privacy. Traditional centralized approaches often require sensitive data transfer, posing privacy concerns. To address this, the study proposes a Multimodal Federated Learning framework for chest X-ray report generation using the IU-Xray dataset. The system utilizes a Vision Transformer (ViT) as the encoder and GPT-2 as the report generator, enabling decentralized training without sharing raw data. Three Federated Learning (FL) aggregation strategies: FedAvg, Krum Aggregation and a novel Loss-aware Federated Averaging (L-FedAvg) were evaluated. Among these, Krum Aggregation demonstrated superior performance across lexical and semantic evaluation metrics such as ROUGE, BLEU, BERTScore and RaTEScore. The results show that FL can match or surpass centralized models in generating clinically relevant and semantically rich radiology reports. This lightweight and privacy-preserving framework paves the way for collaborative medical AI development without compromising data confidentiality.

Privacy-Preserving Chest X-ray Report Generation via Multimodal Federated Learning with ViT and GPT-2

TL;DR

This work tackles privacy concerns in automated chest X-ray report generation by proposing a Multimodal Federated Learning framework that combines a Vision Transformer () encoder with GPT-2 for text generation. It evaluates three aggregation strategies—FedAvg, Krum, and a novel Loss-aware Federated Averaging (-FedAvg)—in a four-client FL setting using the IU-Xray dataset, finding that Krum Aggregation yields the strongest lexical and semantic quality across multiple metrics such as , , , and . The study demonstrates that FL can approach or exceed centralized training performance while preserving data privacy, using a lightweight, replicable setup with Google Drive and Firebase for parameter exchange. These results underscore the feasibility of privacy-preserving, collaborative medical AI development for radiology reporting and provide practical guidance on aggregation strategies for multimodal FL tasks.

Abstract

The automated generation of radiology reports from chest X-ray images holds significant promise in enhancing diagnostic workflows while preserving patient privacy. Traditional centralized approaches often require sensitive data transfer, posing privacy concerns. To address this, the study proposes a Multimodal Federated Learning framework for chest X-ray report generation using the IU-Xray dataset. The system utilizes a Vision Transformer (ViT) as the encoder and GPT-2 as the report generator, enabling decentralized training without sharing raw data. Three Federated Learning (FL) aggregation strategies: FedAvg, Krum Aggregation and a novel Loss-aware Federated Averaging (L-FedAvg) were evaluated. Among these, Krum Aggregation demonstrated superior performance across lexical and semantic evaluation metrics such as ROUGE, BLEU, BERTScore and RaTEScore. The results show that FL can match or surpass centralized models in generating clinically relevant and semantically rich radiology reports. This lightweight and privacy-preserving framework paves the way for collaborative medical AI development without compromising data confidentiality.

Paper Structure

This paper contains 14 sections, 9 figures, 7 tables, 3 algorithms.

Figures (9)

  • Figure 1: Working Approach of our Federated Learning.
  • Figure 2: Sample X-ray images and corresponding findings in form of report from the IU-Xray dataset. This report is being treated as the ground truth.
  • Figure 3: Distribution of report length in number of words
  • Figure 4: Report length distribution in train, test and validation split
  • Figure 5: Training Loss for Clients 1–4 in L-FedAvg
  • ...and 4 more figures