FairMedQA: Benchmarking Bias in Large Language Models for Medical Question Answering
Ying Xiao, Jie Huang, Ruijuan He, Jing Xiao, Mohammad Reza Mousavi, Yepang Liu, Kezhi Li, Zhenpeng Chen, Jie M. Zhang
TL;DR
FairMedQA introduces an automated, scalable bias benchmark for medical QA by generating adversarial counterfactual vignettes that modify sensitive attributes while preserving clinical content. The framework couples automated vignette neutralization with a three-agent generation pipeline and rigorous human QC, enabling large-scale bias assessment across 12 LLMs and six model-version pairs. Results show substantial bias disparities across attributes and demonstrate that improvements in model accuracy can coincide with reductions in bias, challenging the notion of an inherent fairness–accuracy trade-off. The work provides a reproducible benchmark and evidence supporting targeted debiasing and identity-aware validation for safe deployment of medical AI in clinical decision-support systems.
Abstract
Large language models (LLMs) are approaching expert-level performance in medical question answering (QA), demonstrating strong potential to improve public healthcare. However, underlying biases related to sensitive attributes such as sex and race pose life-critical risks. The extent to which such sensitive attributes affect diagnosis remains an open question and requires comprehensive empirical investigation. Additionally, even the latest Counterfactual Patient Variations (CPV) benchmark can hardly distinguish the bias levels of different LLMs. To further explore these dynamics, we propose a new benchmark, FairMedQA, and benchmark 12 representative LLMs. FairMedQA contains 4,806 counterfactual question pairs constructed from 801 clinical vignettes. Our results reveal substantial accuracy disparity ranging from 3 to 19 percentage points across sensitive demographic groups. Notably, FairMedQA exposes biases that are at least 12 percentage points larger than those identified by the latest CPV benchmark, presenting superior benchmarking sensitivity. Our results underscore an urgent need for targeted debiasing techniques and more rigorous, identity-aware validation protocols before LLMs can be safely integrated into practical clinical decision-support systems.
