Fair Evaluation of Federated Learning Algorithms for Automated Breast Density Classification: The Results of the 2022 ACR-NCI-NVIDIA Federated Learning Challenge
Kendall Schmidt, Benjamin Bearce, Ken Chang, Laura Coombs, Keyvan Farahani, Marawan Elbatele, Kaouther Mouhebe, Robert Marti, Ruipeng Zhang, Yao Zhang, Yanfeng Wang, Yaojun Hu, Haochao Ying, Yuyang Xu, Conrad Testagrose, Mutlu Demirer, Vikash Gupta, Ünal Akünal, Markus Bujotzek, Klaus H. Maier-Hein, Yi Qin, Xiaomeng Li, Jayashree Kalpathy-Cramer, Holger R. Roth
TL;DR
This paper evaluates federated learning (FL) approaches for automated breast density classification across multiple simulated medical facilities, addressing generalization under non-IID data and platform constraints. The ACR-NCI-NVIDIA challenge enabled unrestricted FL experimentation using private docker submissions and MedICI/NVIDIA FLARE infrastructure, revealing that a top FL method can rival centrally trained models on challenge data but still experiences drops on external validation and demographic subgroups. Key findings include the benefits and limits of various aggregation strategies (FedAvg, FedProx, SCAFFOLD), the potential value of pretraining and model ensembling, and the need for fairness-aware personalization as performance varies by race and site with limited data. The study demonstrates a practical, data-access-free framework for fair FL evaluation in medical imaging and discusses implications for deploying robust, unbiased AI in diverse clinical settings.
Abstract
The correct interpretation of breast density is important in the assessment of breast cancer risk. AI has been shown capable of accurately predicting breast density, however, due to the differences in imaging characteristics across mammography systems, models built using data from one system do not generalize well to other systems. Though federated learning (FL) has emerged as a way to improve the generalizability of AI without the need to share data, the best way to preserve features from all training data during FL is an active area of research. To explore FL methodology, the breast density classification FL challenge was hosted in partnership with the American College of Radiology, Harvard Medical School's Mass General Brigham, University of Colorado, NVIDIA, and the National Institutes of Health National Cancer Institute. Challenge participants were able to submit docker containers capable of implementing FL on three simulated medical facilities, each containing a unique large mammography dataset. The breast density FL challenge ran from June 15 to September 5, 2022, attracting seven finalists from around the world. The winning FL submission reached a linear kappa score of 0.653 on the challenge test data and 0.413 on an external testing dataset, scoring comparably to a model trained on the same data in a central location.
