Bias patterns in the application of LLMs for clinical decision support: A comprehensive study
Raphael Poulain, Hamed Fayyaz, Rahmatollah Beheshti
TL;DR
This work presents a comprehensive, multi-model assessment of social biases in clinical decision support LLMs. It systematically evaluates eight LLMs across three vignette-based QA tasks (Q-Pain, Nurse Bias, NEJM Healer) and a real-world MIMIC-IV readmission task, employing red-teaming via demographic rotations and three prompting strategies (zero-shot, few-shot, Chain-of-Thought). The study finds heterogeneous bias patterns across models and tasks, with some clinically tuned models exhibiting disparities by race and gender, while others show minimal effects; importantly, Chain-of-Thought prompting can mitigate bias in several cases. The results underscore that model size alone does not predict fairness and highlight the pivotal role of prompt design and dataset characteristics in fairness outcomes. The authors advocate for broader evaluations, transparent reporting, and regulatory guardrails to ensure equitable deployment of LLMs in clinical decision support.
Abstract
Large Language Models (LLMs) have emerged as powerful candidates to inform clinical decision-making processes. While these models play an increasingly prominent role in shaping the digital landscape, two growing concerns emerge in healthcare applications: 1) to what extent do LLMs exhibit social bias based on patients' protected attributes (like race), and 2) how do design choices (like architecture design and prompting strategies) influence the observed biases? To answer these questions rigorously, we evaluated eight popular LLMs across three question-answering (QA) datasets using clinical vignettes (patient descriptions) standardized for bias evaluations. We employ red-teaming strategies to analyze how demographics affect LLM outputs, comparing both general-purpose and clinically-trained models. Our extensive experiments reveal various disparities (some significant) across protected groups. We also observe several counter-intuitive patterns such as larger models not being necessarily less biased and fined-tuned models on medical data not being necessarily better than the general-purpose models. Furthermore, our study demonstrates the impact of prompt design on bias patterns and shows that specific phrasing can influence bias patterns and reflection-type approaches (like Chain of Thought) can reduce biased outcomes effectively. Consistent with prior studies, we call on additional evaluations, scrutiny, and enhancement of LLMs used in clinical decision support applications.
