FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models
Ruinan Jin, Zikang Xu, Yuan Zhong, Qiongsong Yao, Qi Dou, S. Kevin Zhou, Xiaoxiao Li
TL;DR
FairMedFM introduces a comprehensive fairness benchmarking framework for medical imaging foundation models, integrating 17 datasets and 20 FMs to evaluate fairness across classification and segmentation tasks using a unified set of 9 metrics and 6 mitigation methods. The study reveals pervasive bias, dataset-specific disparities, and limited effectiveness of existing debiasing approaches in FM settings, and demonstrates that fairness-utility tradeoffs depend on both model choice and usage. By providing an extensible, open-source codebase, FairMedFM aims to standardize evaluation, promote reproducible research, and guide the development of fairer medical imaging AI, with plans to broaden tasks, modalities, and fairness definitions over time.
Abstract
The advent of foundation models (FMs) in healthcare offers unprecedented opportunities to enhance medical diagnostics through automated classification and segmentation tasks. However, these models also raise significant concerns about their fairness, especially when applied to diverse and underrepresented populations in healthcare applications. Currently, there is a lack of comprehensive benchmarks, standardized pipelines, and easily adaptable libraries to evaluate and understand the fairness performance of FMs in medical imaging, leading to considerable challenges in formulating and implementing solutions that ensure equitable outcomes across diverse patient populations. To fill this gap, we introduce FairMedFM, a fairness benchmark for FM research in medical imaging.FairMedFM integrates with 17 popular medical imaging datasets, encompassing different modalities, dimensionalities, and sensitive attributes. It explores 20 widely used FMs, with various usages such as zero-shot learning, linear probing, parameter-efficient fine-tuning, and prompting in various downstream tasks -- classification and segmentation. Our exhaustive analysis evaluates the fairness performance over different evaluation metrics from multiple perspectives, revealing the existence of bias, varied utility-fairness trade-offs on different FMs, consistent disparities on the same datasets regardless FMs, and limited effectiveness of existing unfairness mitigation methods. Checkout FairMedFM's project page and open-sourced codebase, which supports extendible functionalities and applications as well as inclusive for studies on FMs in medical imaging over the long term.
