Table of Contents
Fetching ...

Med-MMFL: A Multimodal Federated Learning Benchmark in Healthcare

Aavash Chhetri, Bibek Niroula, Pratik Shrestha, Yash Raj Shrestha, Lesley A Anderson, Prashnna K Gyawali, Loris Bazzani, Binod Bhattarai

TL;DR

Med-MMFL addresses the lack of standardized evaluation for multimodal healthcare federated learning by introducing a comprehensive benchmark that spans 5 medical multimodal datasets, 6 FL algorithms, 3 partitioning schemes, and 4 task types. It extends and relativizes existing FL methods to multimodal settings (e.g., m-MOON, CreamMFL) and provides reproducible data processing and partitioning pipelines. Across experiments, no single algorithm dominates across all datasets; FedProx, FedAvg, and SCAFFOLD perform robustly, with FedNova excelling on BraTS-GLI2024 and CreamMFL showing resilience in certain non-IID conditions. By releasing a unified evaluation framework and datasets, Med-MMFL enables reproducible, fair comparisons and accelerates the development of clinically relevant multimodal FL solutions in healthcare.

Abstract

Federated learning (FL) enables collaborative model training across decentralized medical institutions while preserving data privacy. However, medical FL benchmarks remain scarce, with existing efforts focusing mainly on unimodal or bimodal modalities and a limited range of medical tasks. This gap underscores the need for standardized evaluation to advance systematic understanding in medical MultiModal FL (MMFL). To this end, we introduce Med-MMFL, the first comprehensive MMFL benchmark for the medical domain, encompassing diverse modalities, tasks, and federation scenarios. Our benchmark evaluates six representative state-of-the-art FL algorithms, covering different aggregation strategies, loss formulations, and regularization techniques. It spans datasets with 2 to 4 modalities, comprising a total of 10 unique medical modalities, including text, pathology images, ECG, X-ray, radiology reports, and multiple MRI sequences. Experiments are conducted across naturally federated, synthetic IID, and synthetic non-IID settings to simulate real-world heterogeneity. We assess segmentation, classification, modality alignment (retrieval), and VQA tasks. To support reproducibility and fair comparison of future multimodal federated learning (MMFL) methods under realistic medical settings, we release the complete benchmark implementation, including data processing and partitioning pipelines, at https://github.com/bhattarailab/Med-MMFL-Benchmark .

Med-MMFL: A Multimodal Federated Learning Benchmark in Healthcare

TL;DR

Med-MMFL addresses the lack of standardized evaluation for multimodal healthcare federated learning by introducing a comprehensive benchmark that spans 5 medical multimodal datasets, 6 FL algorithms, 3 partitioning schemes, and 4 task types. It extends and relativizes existing FL methods to multimodal settings (e.g., m-MOON, CreamMFL) and provides reproducible data processing and partitioning pipelines. Across experiments, no single algorithm dominates across all datasets; FedProx, FedAvg, and SCAFFOLD perform robustly, with FedNova excelling on BraTS-GLI2024 and CreamMFL showing resilience in certain non-IID conditions. By releasing a unified evaluation framework and datasets, Med-MMFL enables reproducible, fair comparisons and accelerates the development of clinically relevant multimodal FL solutions in healthcare.

Abstract

Federated learning (FL) enables collaborative model training across decentralized medical institutions while preserving data privacy. However, medical FL benchmarks remain scarce, with existing efforts focusing mainly on unimodal or bimodal modalities and a limited range of medical tasks. This gap underscores the need for standardized evaluation to advance systematic understanding in medical MultiModal FL (MMFL). To this end, we introduce Med-MMFL, the first comprehensive MMFL benchmark for the medical domain, encompassing diverse modalities, tasks, and federation scenarios. Our benchmark evaluates six representative state-of-the-art FL algorithms, covering different aggregation strategies, loss formulations, and regularization techniques. It spans datasets with 2 to 4 modalities, comprising a total of 10 unique medical modalities, including text, pathology images, ECG, X-ray, radiology reports, and multiple MRI sequences. Experiments are conducted across naturally federated, synthetic IID, and synthetic non-IID settings to simulate real-world heterogeneity. We assess segmentation, classification, modality alignment (retrieval), and VQA tasks. To support reproducibility and fair comparison of future multimodal federated learning (MMFL) methods under realistic medical settings, we release the complete benchmark implementation, including data processing and partitioning pipelines, at https://github.com/bhattarailab/Med-MMFL-Benchmark .
Paper Structure (38 sections, 3 equations, 9 figures, 7 tables)

This paper contains 38 sections, 3 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Overview of our proposed Med-MMFL benchmark framework. It spans diverse multimodal medical datasets, task types, and client partitioning strategies, integrating multiple FL algorithms to provide a unified evaluation platform.
  • Figure 2: Representative client-level label distribution for the MIMIC-CXR-JPG dataset obtained using our federated partitioning strategy. IID splits produce similar label frequencies across clients, whereas non-IID splits yield heterogeneous distributions where certain labels dominate or are absent on specific clients. Other datasets exhibit analogous distribution patterns under the same protocol (see \ref{['sec:supp_datasets']})
  • Figure 3: Number of settings in which each algorithm outperforms the others across our Med-MMFL benchmark. The stacked bars are color-coded by data partition type.
  • Figure 4: The distribution of labels in MIMIC-CXR-JPG johnson2019mimiccxrjpglargepubliclyavailable
  • Figure 5: t-SNE visualization of class-proportion vectors across clients for the Fed-BraTS-GLI2024 dataset under three partitioning strategies: synthetic IID, synthetic non-IID ($\alpha=0.8$), and synthetic non-IID ($\alpha=0.2$), each with 3 clients. Class-proportion vectors are derived from normalized voxel counts per class within each segmentation mask. As the Dirichlet concentration parameter $\alpha$ decreases, client distributions become increasingly separated, illustrating the controlled progression from balanced to highly skewed label distributions.
  • ...and 4 more figures