Table of Contents
Fetching ...

How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making?

Kenza Benkirane, Jackie Kay, Maria Perez-Ortiz

TL;DR

The findings reveal that addressing social biases in LLMs requires a multidimensional approach as mitigating gender bias can occur while introducing ethnicity biases, and that gender bias in LLM embeddings varies significantly across medical specialities.

Abstract

Recent advancements in Large Language Models (LLMs) have positioned them as powerful tools for clinical decision-making, with rapidly expanding applications in healthcare. However, concerns about bias remain a significant challenge in the clinical implementation of LLMs, particularly regarding gender and ethnicity. This research investigates the evaluation and mitigation of bias in LLMs applied to complex clinical cases, focusing on gender and ethnicity biases. We introduce a novel Counterfactual Patient Variations (CPV) dataset derived from the JAMA Clinical Challenge. Using this dataset, we built a framework for bias evaluation, employing both Multiple Choice Questions (MCQs) and corresponding explanations. We explore prompting with eight LLMs and fine-tuning as debiasing methods. Our findings reveal that addressing social biases in LLMs requires a multidimensional approach as mitigating gender bias can occur while introducing ethnicity biases, and that gender bias in LLM embeddings varies significantly across medical specialities. We demonstrate that evaluating both MCQ response and explanation processes is crucial, as correct responses can be based on biased \textit{reasoning}. We provide a framework for evaluating LLM bias in real-world clinical cases, offer insights into the complex nature of bias in these models, and present strategies for bias mitigation.

How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making?

TL;DR

The findings reveal that addressing social biases in LLMs requires a multidimensional approach as mitigating gender bias can occur while introducing ethnicity biases, and that gender bias in LLM embeddings varies significantly across medical specialities.

Abstract

Recent advancements in Large Language Models (LLMs) have positioned them as powerful tools for clinical decision-making, with rapidly expanding applications in healthcare. However, concerns about bias remain a significant challenge in the clinical implementation of LLMs, particularly regarding gender and ethnicity. This research investigates the evaluation and mitigation of bias in LLMs applied to complex clinical cases, focusing on gender and ethnicity biases. We introduce a novel Counterfactual Patient Variations (CPV) dataset derived from the JAMA Clinical Challenge. Using this dataset, we built a framework for bias evaluation, employing both Multiple Choice Questions (MCQs) and corresponding explanations. We explore prompting with eight LLMs and fine-tuning as debiasing methods. Our findings reveal that addressing social biases in LLMs requires a multidimensional approach as mitigating gender bias can occur while introducing ethnicity biases, and that gender bias in LLM embeddings varies significantly across medical specialities. We demonstrate that evaluating both MCQ response and explanation processes is crucial, as correct responses can be based on biased \textit{reasoning}. We provide a framework for evaluating LLM bias in real-world clinical cases, offer insights into the complex nature of bias in these models, and present strategies for bias mitigation.

Paper Structure

This paper contains 53 sections, 1 equation, 7 figures, 47 tables.

Figures (7)

  • Figure 1: Illustration of our experimental setup for evaluating bias in LLMs for clinical cases using Counterfactual Patient Variations (CPVs). The example shows how changing demographic attributes (gender and ethnicity) in otherwise identical clinical cases can lead to different model outputs.
  • Figure 2: Exploratory CPVs | Top 5 features and their importance with regards to MCQ performance. This figure illustrates that ethnicity features became highly influential when introduced, often surpassing gender features in importance. It demonstrates how the introduction of ethnicity shifted rather than eliminated bias patterns.
  • Figure 3: Bias mitigation with fine-tuning | BiasScore and GenderBias across social attributes for the baseline and fine-tuned models. This figure demonstrates that fine-tuning significantly altered gender bias patterns in explanations, substantially mitigating extreme biases across genders, albeit with some overcorrections.
  • Figure 4: Bias mitigation with fine-tuning | Heatmap of BiasScore and GenderBias across medical fields for baseline and fine-tuned models. This figure reveals significant variations in BiasScore across medical specialities, suggesting that gender stereotypes are not uniformly distributed in clinical contexts and that addressing gender bias may require a speciality-specific approach.
  • Figure 5: Ablation study without multiple-choice | WordCloud for unique words per Ethnicity From the top to bottom: No ethnicity, White, Black, Asian, Hispanic, Arab. From left to right: Sonnet, GPT-3.5, GPT-4o, Gemini, Haiku, GPT-4 Turbo
  • ...and 2 more figures