Table of Contents
Fetching ...

Fairness in Multi-modal Medical Diagnosis with Demonstration Selection

Dawei Li, Zijian Gu, Peng Wang, Chuhan Song, Zhen Tan, Mohan Zhang, Tianlong Chen, Yu Tian, Song Wang

TL;DR

The paper addresses fairness gaps in medical image reasoning with multimodal language models and shows that standard in-context demonstration strategies propagate demographic bias. It introduces Fairness-Aware Demonstration Selection (FADS), a tuning-free framework that constructs demographically balanced and semantically relevant demonstrations via clustering-based data bias mitigation and balanced sampling. Through extensive experiments on FairCLIP Glaucoma and CheXpert Plus datasets with models like Qwen and LLaVA-Med, FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining competitive accuracy, demonstrating a scalable path toward equitable medical image reasoning. Overall, the work highlights fairness-aware in-context learning as a practical, data-efficient approach for trustworthy and scalable medical AI without requiring model retraining.

Abstract

Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.

Fairness in Multi-modal Medical Diagnosis with Demonstration Selection

TL;DR

The paper addresses fairness gaps in medical image reasoning with multimodal language models and shows that standard in-context demonstration strategies propagate demographic bias. It introduces Fairness-Aware Demonstration Selection (FADS), a tuning-free framework that constructs demographically balanced and semantically relevant demonstrations via clustering-based data bias mitigation and balanced sampling. Through extensive experiments on FairCLIP Glaucoma and CheXpert Plus datasets with models like Qwen and LLaVA-Med, FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining competitive accuracy, demonstrating a scalable path toward equitable medical image reasoning. Overall, the work highlights fairness-aware in-context learning as a practical, data-efficient approach for trustworthy and scalable medical AI without requiring model retraining.

Abstract

Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.

Paper Structure

This paper contains 23 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of challenges and findings. We investigate whether ICL can improve fairness in medical image reasoning. Different demonstration selection (DS) strategies—Random, Similarity, and K-Means—are analyzed across attributes (Gender, Race, Ethnicity). Empirical results reveal that data imbalance in selected demonstrations is the primary source of fairness degradation.
  • Figure 2: Comparison of performance and fairness metrics for Random, Similarity, and K-Means demonstration selection methods on Qwen (bottom row) and LLaVA-Med (top row). Conventional DS strategies exhibit inconsistent and sometimes conflicting fairness behaviors.
  • Figure 3: Correlation between data imbalance and fairness disparity. Each point corresponds to one attribute under a DS method (Random, Similarity, or K-Means). Larger demographic imbalance (MaxDiff) in the selected demonstrations correlates with higher Average Disparity (AD), indicating that biased exemplar composition directly drives fairness degradation.
  • Figure 4: Case study: Similarity-based ICL vs FADS. Similarity overrepresents majority groups (75% Male, 50% White) and excludes the Black subgroup; FADS enforces balanced sampling (50/50 gender, 33.3% per race).