Table of Contents
Fetching ...

FairMedQA: Benchmarking Bias in Large Language Models for Medical Question Answering

Ying Xiao, Jie Huang, Ruijuan He, Jing Xiao, Mohammad Reza Mousavi, Yepang Liu, Kezhi Li, Zhenpeng Chen, Jie M. Zhang

TL;DR

FairMedQA introduces an automated, scalable bias benchmark for medical QA by generating adversarial counterfactual vignettes that modify sensitive attributes while preserving clinical content. The framework couples automated vignette neutralization with a three-agent generation pipeline and rigorous human QC, enabling large-scale bias assessment across 12 LLMs and six model-version pairs. Results show substantial bias disparities across attributes and demonstrate that improvements in model accuracy can coincide with reductions in bias, challenging the notion of an inherent fairness–accuracy trade-off. The work provides a reproducible benchmark and evidence supporting targeted debiasing and identity-aware validation for safe deployment of medical AI in clinical decision-support systems.

Abstract

Large language models (LLMs) are approaching expert-level performance in medical question answering (QA), demonstrating strong potential to improve public healthcare. However, underlying biases related to sensitive attributes such as sex and race pose life-critical risks. The extent to which such sensitive attributes affect diagnosis remains an open question and requires comprehensive empirical investigation. Additionally, even the latest Counterfactual Patient Variations (CPV) benchmark can hardly distinguish the bias levels of different LLMs. To further explore these dynamics, we propose a new benchmark, FairMedQA, and benchmark 12 representative LLMs. FairMedQA contains 4,806 counterfactual question pairs constructed from 801 clinical vignettes. Our results reveal substantial accuracy disparity ranging from 3 to 19 percentage points across sensitive demographic groups. Notably, FairMedQA exposes biases that are at least 12 percentage points larger than those identified by the latest CPV benchmark, presenting superior benchmarking sensitivity. Our results underscore an urgent need for targeted debiasing techniques and more rigorous, identity-aware validation protocols before LLMs can be safely integrated into practical clinical decision-support systems.

FairMedQA: Benchmarking Bias in Large Language Models for Medical Question Answering

TL;DR

FairMedQA introduces an automated, scalable bias benchmark for medical QA by generating adversarial counterfactual vignettes that modify sensitive attributes while preserving clinical content. The framework couples automated vignette neutralization with a three-agent generation pipeline and rigorous human QC, enabling large-scale bias assessment across 12 LLMs and six model-version pairs. Results show substantial bias disparities across attributes and demonstrate that improvements in model accuracy can coincide with reductions in bias, challenging the notion of an inherent fairness–accuracy trade-off. The work provides a reproducible benchmark and evidence supporting targeted debiasing and identity-aware validation for safe deployment of medical AI in clinical decision-support systems.

Abstract

Large language models (LLMs) are approaching expert-level performance in medical question answering (QA), demonstrating strong potential to improve public healthcare. However, underlying biases related to sensitive attributes such as sex and race pose life-critical risks. The extent to which such sensitive attributes affect diagnosis remains an open question and requires comprehensive empirical investigation. Additionally, even the latest Counterfactual Patient Variations (CPV) benchmark can hardly distinguish the bias levels of different LLMs. To further explore these dynamics, we propose a new benchmark, FairMedQA, and benchmark 12 representative LLMs. FairMedQA contains 4,806 counterfactual question pairs constructed from 801 clinical vignettes. Our results reveal substantial accuracy disparity ranging from 3 to 19 percentage points across sensitive demographic groups. Notably, FairMedQA exposes biases that are at least 12 percentage points larger than those identified by the latest CPV benchmark, presenting superior benchmarking sensitivity. Our results underscore an urgent need for targeted debiasing techniques and more rigorous, identity-aware validation protocols before LLMs can be safely integrated into practical clinical decision-support systems.

Paper Structure

This paper contains 44 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Workflow of FairMedQA dataset construction, including (1) clinical vignette preparation, (2) adversarial variant construction, and (3) manual quality control. (1) Clinical vignettes are filtered and rewritten into neutralized versions without sensitive attributes. (2) Neutralized vignettes are passed to the Generation-Agent, which produces adversarial descriptions based on sensitive attributes. These are then fused with the neutral vignettes by the Fusion-Agent to create adversarial variants. The Validation-Agent assesses whether the variants trigger bias, labeling them as "successful" or "failed"; each variant can be revised up to two times. (3) All variants, regardless of outcome, are reviewed and refined by human auditors based on quality criteria.
  • Figure 2: An example of a USMLE-style medical question.
  • Figure 3: Diagnostic accuracy of 12 LLMs on FairMedQA dataset.
  • Figure 4: Counterfactual fair rate (CFR) and accuracy disparity (AD) of LLMs on FairMedQA. Among the studied models, GPT-5 achieves the best fairness under both metrics, with an average CFR of 94% and an average AD of 0.03 across the three sensitive attributes.
  • Figure 5: Percentage of adversarial clinical vignette variants successfully triggering the bias of Validation-Agent after three trials. "Round 1" means the bias triggering rate in the first trials of variant generation. Both variants from GPT-Agent and Deepseek-Agent can significantly trigger Validation-Agent bias, ranging from 12.9% to 29.8% across six groups.
  • ...and 4 more figures