Table of Contents
Fetching ...

FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models

Ruinan Jin, Zikang Xu, Yuan Zhong, Qiongsong Yao, Qi Dou, S. Kevin Zhou, Xiaoxiao Li

TL;DR

FairMedFM introduces a comprehensive fairness benchmarking framework for medical imaging foundation models, integrating 17 datasets and 20 FMs to evaluate fairness across classification and segmentation tasks using a unified set of 9 metrics and 6 mitigation methods. The study reveals pervasive bias, dataset-specific disparities, and limited effectiveness of existing debiasing approaches in FM settings, and demonstrates that fairness-utility tradeoffs depend on both model choice and usage. By providing an extensible, open-source codebase, FairMedFM aims to standardize evaluation, promote reproducible research, and guide the development of fairer medical imaging AI, with plans to broaden tasks, modalities, and fairness definitions over time.

Abstract

The advent of foundation models (FMs) in healthcare offers unprecedented opportunities to enhance medical diagnostics through automated classification and segmentation tasks. However, these models also raise significant concerns about their fairness, especially when applied to diverse and underrepresented populations in healthcare applications. Currently, there is a lack of comprehensive benchmarks, standardized pipelines, and easily adaptable libraries to evaluate and understand the fairness performance of FMs in medical imaging, leading to considerable challenges in formulating and implementing solutions that ensure equitable outcomes across diverse patient populations. To fill this gap, we introduce FairMedFM, a fairness benchmark for FM research in medical imaging.FairMedFM integrates with 17 popular medical imaging datasets, encompassing different modalities, dimensionalities, and sensitive attributes. It explores 20 widely used FMs, with various usages such as zero-shot learning, linear probing, parameter-efficient fine-tuning, and prompting in various downstream tasks -- classification and segmentation. Our exhaustive analysis evaluates the fairness performance over different evaluation metrics from multiple perspectives, revealing the existence of bias, varied utility-fairness trade-offs on different FMs, consistent disparities on the same datasets regardless FMs, and limited effectiveness of existing unfairness mitigation methods. Checkout FairMedFM's project page and open-sourced codebase, which supports extendible functionalities and applications as well as inclusive for studies on FMs in medical imaging over the long term.

FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models

TL;DR

FairMedFM introduces a comprehensive fairness benchmarking framework for medical imaging foundation models, integrating 17 datasets and 20 FMs to evaluate fairness across classification and segmentation tasks using a unified set of 9 metrics and 6 mitigation methods. The study reveals pervasive bias, dataset-specific disparities, and limited effectiveness of existing debiasing approaches in FM settings, and demonstrates that fairness-utility tradeoffs depend on both model choice and usage. By providing an extensible, open-source codebase, FairMedFM aims to standardize evaluation, promote reproducible research, and guide the development of fairer medical imaging AI, with plans to broaden tasks, modalities, and fairness definitions over time.

Abstract

The advent of foundation models (FMs) in healthcare offers unprecedented opportunities to enhance medical diagnostics through automated classification and segmentation tasks. However, these models also raise significant concerns about their fairness, especially when applied to diverse and underrepresented populations in healthcare applications. Currently, there is a lack of comprehensive benchmarks, standardized pipelines, and easily adaptable libraries to evaluate and understand the fairness performance of FMs in medical imaging, leading to considerable challenges in formulating and implementing solutions that ensure equitable outcomes across diverse patient populations. To fill this gap, we introduce FairMedFM, a fairness benchmark for FM research in medical imaging.FairMedFM integrates with 17 popular medical imaging datasets, encompassing different modalities, dimensionalities, and sensitive attributes. It explores 20 widely used FMs, with various usages such as zero-shot learning, linear probing, parameter-efficient fine-tuning, and prompting in various downstream tasks -- classification and segmentation. Our exhaustive analysis evaluates the fairness performance over different evaluation metrics from multiple perspectives, revealing the existence of bias, varied utility-fairness trade-offs on different FMs, consistent disparities on the same datasets regardless FMs, and limited effectiveness of existing unfairness mitigation methods. Checkout FairMedFM's project page and open-sourced codebase, which supports extendible functionalities and applications as well as inclusive for studies on FMs in medical imaging over the long term.
Paper Structure (41 sections, 11 equations, 19 figures, 18 tables)

This paper contains 41 sections, 11 equations, 19 figures, 18 tables.

Figures (19)

  • Figure 1: Overview of the FairMedFM framework, a standardized pipeline to investigate fairness on diverse datasets (2D, 2.5D, and 3D), comprehensive functionalities (various FMs, tasks, usages, and debias algorithms), thorough evaluation metrics. The details are explained in Sec. \ref{['sec:FairMedFM']}.
  • Figure 2: Bias in classification tasks. $\text{AUC}_\Delta$ is the fairness evaluation metric. "$\dagger$" denotes vision-language models, and "$*$" denotes pure vision models, where CLIP-ZS and CLIP-Adapt are not applicable.
  • Figure 3: Bias in segmentation tasks. $\text{DSC}_\Delta$ is the fairness evaluation metric. Note that SAM-Med3D, FastSAM-3D, and SegVol are only applicable for 3D datasets.
  • Figure 4: Fairness-utility tradeoff in FMs for different components on (a-b) classification and (c-d) segmentation tasks. We use $\text{AUC}_{\text{ES}}$ and $\text{DSC}_{\text{ES}}$ as evaluation metrics. (a) Fairness-utility tradeoff for different classification usages (using CLIP as an example); (b) Fairness-utility tradeoff for general-purpose and medical-specific FMs in classification; (c) Fairness-utility tradeoff for different prompts in segmentation (using SAM-Med2D as an example); (d) Degree of fairness for different SegFM categories in segmentation.
  • Figure 5: Consistent disparities in SA occur across various FMs on the same dataset. (a) LP; (b) CLIP-Adapt. Points on the black dashline represent equal utility.
  • ...and 14 more figures