FedMedICL: Towards Holistic Evaluation of Distribution Shifts in Federated Medical Imaging

Kumail Alhamoud; Yasir Ghunaim; Motasem Alfarra; Thomas Hartvigsen; Philip Torr; Bernard Ghanem; Adel Bibi; Marzyeh Ghassemi

FedMedICL: Towards Holistic Evaluation of Distribution Shifts in Federated Medical Imaging

Kumail Alhamoud, Yasir Ghunaim, Motasem Alfarra, Thomas Hartvigsen, Philip Torr, Bernard Ghanem, Adel Bibi, Marzyeh Ghassemi

TL;DR

FedMedICL addresses generalization under simultaneous distribution shifts in federated medical imaging by unifying label, demographic, and temporal shifts into a single benchmark. It provides a problem formulation and an extensible testbed with six datasets and a pandemic-spread scenario to evaluate continual learning under federated settings. Empirical results show that a simple batch-balancing approach often outperforms more complex federated baselines, highlighting limitations of prior benchmarks that treat shifts in isolation. The work proposes a flexible, reproducible framework that can guide future development of robust medical imaging models in realistic, data-siloed clinical environments.

Abstract

For medical imaging AI models to be clinically impactful, they must generalize. However, this goal is hindered by (i) diverse types of distribution shifts, such as temporal, demographic, and label shifts, and (ii) limited diversity in datasets that are siloed within single medical institutions. While these limitations have spurred interest in federated learning, current evaluation benchmarks fail to evaluate different shifts simultaneously. However, in real healthcare settings, multiple types of shifts co-exist, yet their impact on medical imaging performance remains unstudied. In response, we introduce FedMedICL, a unified framework and benchmark to holistically evaluate federated medical imaging challenges, simultaneously capturing label, demographic, and temporal distribution shifts. We comprehensively evaluate several popular methods on six diverse medical imaging datasets (totaling 550 GPU hours). Furthermore, we use FedMedICL to simulate COVID-19 propagation across hospitals and evaluate whether methods can adapt to pandemic changes in disease prevalence. We find that a simple batch balancing technique surpasses advanced methods in average performance across FedMedICL experiments. This finding questions the applicability of results from previous, narrow benchmarks in real-world medical settings.

FedMedICL: Towards Holistic Evaluation of Distribution Shifts in Federated Medical Imaging

TL;DR

Abstract

Paper Structure (11 sections, 6 figures, 2 tables)

This paper contains 11 sections, 6 figures, 2 tables.

Introduction
Related Work
FedMedICL: Problem and Benchmark
Background on Federated and Continual Learning
Benchmark Construction Methodology
Evaluation and Datasets
Benchmark Results
Spread of the COVID-19 Pandemic Between Hospitals
Conclusions and Future Research
Acknowledgments.
Disclosure of Interests.

Figures (6)

Figure 1: (a) Problem Setup. We model a federated medical imaging scenario, in which siloed hospitals experience demographic imbalances and temporal shifts. (b) FedMedICL Benchmark Construction. We construct client datasets ($\mathcal{D}^1$ to $\mathcal{D}^K$), each representing a hospital with unique demographic characteristics and temporal training tasks ($\mathcal{D}^i_1$ to $\mathcal{D}^i_T$). We evaluate models on temporally aligned test tasks for testing adaptability to local demographic shifts, and on a hold-out set ($\mathcal{D}_h$) to evaluate generalization to diverse demographics.
Figure 2: Benchmarking Demographic Shifts. We simulate age-based demographic changes over time and benchmark methods in different datasets. The mean LTR accuracy across clients is reported for each method. Except on the PAPILA dataset, no method reliably competes with the simple F-CB baseline.
Figure 3: Performance on New Demographic Distributions. We report the LTR accuracy on a hold-out test set, as shown in \ref{['subfig:FedMedICL_setup']}. Results are averaged across clients. No method consistently generalizes better than F-CB.
Figure 4: Adaptation under Pandemic Conditions. We simulate two types of hospitals experiencing COVID-19 emergence over four tasks. We report performance on non-COVID labels, in addition to examining how various methods perform in recognizing the novel COVID-19 disease across the four time steps.
Figure 5: Comparing pixel contrast density different datasets, we observe the significant difference in image characteristics across age groups in PAPILA.
...and 1 more figures

FedMedICL: Towards Holistic Evaluation of Distribution Shifts in Federated Medical Imaging

TL;DR

Abstract

FedMedICL: Towards Holistic Evaluation of Distribution Shifts in Federated Medical Imaging

Authors

TL;DR

Abstract

Table of Contents

Figures (6)