Fair Evaluation of Federated Learning Algorithms for Automated Breast Density Classification: The Results of the 2022 ACR-NCI-NVIDIA Federated Learning Challenge

Kendall Schmidt; Benjamin Bearce; Ken Chang; Laura Coombs; Keyvan Farahani; Marawan Elbatele; Kaouther Mouhebe; Robert Marti; Ruipeng Zhang; Yao Zhang; Yanfeng Wang; Yaojun Hu; Haochao Ying; Yuyang Xu; Conrad Testagrose; Mutlu Demirer; Vikash Gupta; Ünal Akünal; Markus Bujotzek; Klaus H. Maier-Hein; Yi Qin; Xiaomeng Li; Jayashree Kalpathy-Cramer; Holger R. Roth

Fair Evaluation of Federated Learning Algorithms for Automated Breast Density Classification: The Results of the 2022 ACR-NCI-NVIDIA Federated Learning Challenge

Kendall Schmidt, Benjamin Bearce, Ken Chang, Laura Coombs, Keyvan Farahani, Marawan Elbatele, Kaouther Mouhebe, Robert Marti, Ruipeng Zhang, Yao Zhang, Yanfeng Wang, Yaojun Hu, Haochao Ying, Yuyang Xu, Conrad Testagrose, Mutlu Demirer, Vikash Gupta, Ünal Akünal, Markus Bujotzek, Klaus H. Maier-Hein, Yi Qin, Xiaomeng Li, Jayashree Kalpathy-Cramer, Holger R. Roth

TL;DR

This paper evaluates federated learning (FL) approaches for automated breast density classification across multiple simulated medical facilities, addressing generalization under non-IID data and platform constraints. The ACR-NCI-NVIDIA challenge enabled unrestricted FL experimentation using private docker submissions and MedICI/NVIDIA FLARE infrastructure, revealing that a top FL method can rival centrally trained models on challenge data but still experiences drops on external validation and demographic subgroups. Key findings include the benefits and limits of various aggregation strategies (FedAvg, FedProx, SCAFFOLD), the potential value of pretraining and model ensembling, and the need for fairness-aware personalization as performance varies by race and site with limited data. The study demonstrates a practical, data-access-free framework for fair FL evaluation in medical imaging and discusses implications for deploying robust, unbiased AI in diverse clinical settings.

Abstract

The correct interpretation of breast density is important in the assessment of breast cancer risk. AI has been shown capable of accurately predicting breast density, however, due to the differences in imaging characteristics across mammography systems, models built using data from one system do not generalize well to other systems. Though federated learning (FL) has emerged as a way to improve the generalizability of AI without the need to share data, the best way to preserve features from all training data during FL is an active area of research. To explore FL methodology, the breast density classification FL challenge was hosted in partnership with the American College of Radiology, Harvard Medical School's Mass General Brigham, University of Colorado, NVIDIA, and the National Institutes of Health National Cancer Institute. Challenge participants were able to submit docker containers capable of implementing FL on three simulated medical facilities, each containing a unique large mammography dataset. The breast density FL challenge ran from June 15 to September 5, 2022, attracting seven finalists from around the world. The winning FL submission reached a linear kappa score of 0.653 on the challenge test data and 0.413 on an external testing dataset, scoring comparably to a model trained on the same data in a central location.

Fair Evaluation of Federated Learning Algorithms for Automated Breast Density Classification: The Results of the 2022 ACR-NCI-NVIDIA Federated Learning Challenge

TL;DR

Abstract

Paper Structure (27 sections, 1 equation, 9 figures, 5 tables)

This paper contains 27 sections, 1 equation, 9 figures, 5 tables.

Introduction
Methods
Challenge Overview
Challenge Architecture
MedICI
FL environment
Challenge Data
Ranking
Post-Challenge Analysis
External Validation
Demographic Bias Analysis
Results
Phase II submissions
1st Rank (Algo. #1)
2nd Rank (Algo. #2)
...and 12 more sections

Figures (9)

Figure 1: Mammography examples representing each of the four BIRADS breast density categories.
Figure 2: Breast density federated learning challenge architecture diagram: Depicted inside the white ACR Azure square are 5 smaller white VM squares. These are from left to right the MedICI Server, FL Server, and 3 FL clients. User containers are in red. Other resources include Azure storage (bottom yellow), Azure docker image container registry (top yellow), data indicated by “SH data” purple squares, uploaded algorithms as .zip in green. Finally a firewall was used to control access. After approval by challenge admin, users were allowed access to submit their FL algorithms through MedICI, which coordinated the kickoff of FL runs across the ACR FL server and the three ACR Clients.
Figure 3: BI-RADS scores across sites. 1: Almost entirely fatty, 2: Scattered areas of fibroglandular tissue, 3: Heterogeneously dense, 4: Extremely dense.
Figure 4: Site-based metrics in Phase II (Test phase).
Figure 5: Ranking stability using per-image distance metrics in Phase II (Test phase).
...and 4 more figures

Fair Evaluation of Federated Learning Algorithms for Automated Breast Density Classification: The Results of the 2022 ACR-NCI-NVIDIA Federated Learning Challenge

TL;DR

Abstract

Fair Evaluation of Federated Learning Algorithms for Automated Breast Density Classification: The Results of the 2022 ACR-NCI-NVIDIA Federated Learning Challenge

Authors

TL;DR

Abstract

Table of Contents

Figures (9)