Table of Contents
Fetching ...

Bias patterns in the application of LLMs for clinical decision support: A comprehensive study

Raphael Poulain, Hamed Fayyaz, Rahmatollah Beheshti

TL;DR

This work presents a comprehensive, multi-model assessment of social biases in clinical decision support LLMs. It systematically evaluates eight LLMs across three vignette-based QA tasks (Q-Pain, Nurse Bias, NEJM Healer) and a real-world MIMIC-IV readmission task, employing red-teaming via demographic rotations and three prompting strategies (zero-shot, few-shot, Chain-of-Thought). The study finds heterogeneous bias patterns across models and tasks, with some clinically tuned models exhibiting disparities by race and gender, while others show minimal effects; importantly, Chain-of-Thought prompting can mitigate bias in several cases. The results underscore that model size alone does not predict fairness and highlight the pivotal role of prompt design and dataset characteristics in fairness outcomes. The authors advocate for broader evaluations, transparent reporting, and regulatory guardrails to ensure equitable deployment of LLMs in clinical decision support.

Abstract

Large Language Models (LLMs) have emerged as powerful candidates to inform clinical decision-making processes. While these models play an increasingly prominent role in shaping the digital landscape, two growing concerns emerge in healthcare applications: 1) to what extent do LLMs exhibit social bias based on patients' protected attributes (like race), and 2) how do design choices (like architecture design and prompting strategies) influence the observed biases? To answer these questions rigorously, we evaluated eight popular LLMs across three question-answering (QA) datasets using clinical vignettes (patient descriptions) standardized for bias evaluations. We employ red-teaming strategies to analyze how demographics affect LLM outputs, comparing both general-purpose and clinically-trained models. Our extensive experiments reveal various disparities (some significant) across protected groups. We also observe several counter-intuitive patterns such as larger models not being necessarily less biased and fined-tuned models on medical data not being necessarily better than the general-purpose models. Furthermore, our study demonstrates the impact of prompt design on bias patterns and shows that specific phrasing can influence bias patterns and reflection-type approaches (like Chain of Thought) can reduce biased outcomes effectively. Consistent with prior studies, we call on additional evaluations, scrutiny, and enhancement of LLMs used in clinical decision support applications.

Bias patterns in the application of LLMs for clinical decision support: A comprehensive study

TL;DR

This work presents a comprehensive, multi-model assessment of social biases in clinical decision support LLMs. It systematically evaluates eight LLMs across three vignette-based QA tasks (Q-Pain, Nurse Bias, NEJM Healer) and a real-world MIMIC-IV readmission task, employing red-teaming via demographic rotations and three prompting strategies (zero-shot, few-shot, Chain-of-Thought). The study finds heterogeneous bias patterns across models and tasks, with some clinically tuned models exhibiting disparities by race and gender, while others show minimal effects; importantly, Chain-of-Thought prompting can mitigate bias in several cases. The results underscore that model size alone does not predict fairness and highlight the pivotal role of prompt design and dataset characteristics in fairness outcomes. The authors advocate for broader evaluations, transparent reporting, and regulatory guardrails to ensure equitable deployment of LLMs in clinical decision support.

Abstract

Large Language Models (LLMs) have emerged as powerful candidates to inform clinical decision-making processes. While these models play an increasingly prominent role in shaping the digital landscape, two growing concerns emerge in healthcare applications: 1) to what extent do LLMs exhibit social bias based on patients' protected attributes (like race), and 2) how do design choices (like architecture design and prompting strategies) influence the observed biases? To answer these questions rigorously, we evaluated eight popular LLMs across three question-answering (QA) datasets using clinical vignettes (patient descriptions) standardized for bias evaluations. We employ red-teaming strategies to analyze how demographics affect LLM outputs, comparing both general-purpose and clinically-trained models. Our extensive experiments reveal various disparities (some significant) across protected groups. We also observe several counter-intuitive patterns such as larger models not being necessarily less biased and fined-tuned models on medical data not being necessarily better than the general-purpose models. Furthermore, our study demonstrates the impact of prompt design on bias patterns and shows that specific phrasing can influence bias patterns and reflection-type approaches (like Chain of Thought) can reduce biased outcomes effectively. Consistent with prior studies, we call on additional evaluations, scrutiny, and enhancement of LLMs used in clinical decision support applications.
Paper Structure (28 sections, 9 figures)

This paper contains 28 sections, 9 figures.

Figures (9)

  • Figure 1: Visual description of the evaluation framework.
  • Figure 2: Results on the Q-Pain dataset. The LLMs were presented with clinical vignettes describing various medical contexts and were asked whether they would prescribe pain medication to the patients. Each demographic is color-coded and the bars represent the average probability of denying the pain treatment for each tasks. The error bars show the standard deviation. CNC: Chronic Non Cancer, CC: Chronic Cancer, AC: Acute Cancer, ANC: Acute Non Cancer, Post Op: Postoperative
  • Figure 3: Violin plot of the results on the LLMs' perception of patients based on a Likert scale. The LLMs were presented with patient summaries and statements related to pain perception or illness severity and were asked to rate their agreement with the statement. 1:Strongly disagree with the statement. 5:Strongly agree.
  • Figure 4: Results on the NEJM Healer vignettes in a treatment recommendation scenario. The LLMs were given a clinical vignette and were asked whether they would refer the patient to a specialist and medical imaging. Imaging Rate is hatched (Left side), Referral Rate is filled (Right side). Each gender is color-coded. The black vertical bar represents a standard deviation.
  • Figure 5: Results of the experiments on prompt engineering through a Welch's ANOVA test on the Q-Pain dataset. Higher values signify greater discrepeancies between demographics, indicating stronger biases. Detailed results in Figures \ref{['fig:prompt_full']} and \ref{['fig:prompt_full2']}.
  • ...and 4 more figures