Table of Contents
Fetching ...

DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using Large Language Models

Rajat Rawat, Hudson McBride, Dhiyaan Nirmal, Rajarshi Ghosh, Jong Moon, Dhruv Alamuri, Sean O'Brien, Kevin Zhu

TL;DR

DiversityMedQA introduces a demographic-bias benchmark for medical diagnosis by perturbing MedQA questions along gender and ethnicity axes. The study evaluates multiple LLMs, employs few-shot chain-of-thought prompting and a perturbation-filtering step to ensure perturbations are clinically appropriate, and analyzes bias via first-index and Maj@5 accuracies with Z-tests and IoU metrics. Results show substantial performance gains for the GPT-4 family and limited gender bias, while ethnicity-related biases appear more model-dependent and pronounced in older models like Llama3-8B. The work provides a practical framework and dataset to assess and mitigate demographic bias in medical QA, with implications for safer and more equitable AI-assisted healthcare.

Abstract

As large language models (LLMs) gain traction in healthcare, concerns about their susceptibility to demographic biases are growing. We introduce {DiversityMedQA}, a novel benchmark designed to assess LLM responses to medical queries across diverse patient demographics, such as gender and ethnicity. By perturbing questions from the MedQA dataset, which comprises medical board exam questions, we created a benchmark that captures the nuanced differences in medical diagnosis across varying patient profiles. Our findings reveal notable discrepancies in model performance when tested against these demographic variations. Furthermore, to ensure the perturbations were accurate, we also propose a filtering strategy that validates each perturbation. By releasing DiversityMedQA, we provide a resource for evaluating and mitigating demographic bias in LLM medical diagnoses.

DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using Large Language Models

TL;DR

DiversityMedQA introduces a demographic-bias benchmark for medical diagnosis by perturbing MedQA questions along gender and ethnicity axes. The study evaluates multiple LLMs, employs few-shot chain-of-thought prompting and a perturbation-filtering step to ensure perturbations are clinically appropriate, and analyzes bias via first-index and Maj@5 accuracies with Z-tests and IoU metrics. Results show substantial performance gains for the GPT-4 family and limited gender bias, while ethnicity-related biases appear more model-dependent and pronounced in older models like Llama3-8B. The work provides a practical framework and dataset to assess and mitigate demographic bias in medical QA, with implications for safer and more equitable AI-assisted healthcare.

Abstract

As large language models (LLMs) gain traction in healthcare, concerns about their susceptibility to demographic biases are growing. We introduce {DiversityMedQA}, a novel benchmark designed to assess LLM responses to medical queries across diverse patient demographics, such as gender and ethnicity. By perturbing questions from the MedQA dataset, which comprises medical board exam questions, we created a benchmark that captures the nuanced differences in medical diagnosis across varying patient profiles. Our findings reveal notable discrepancies in model performance when tested against these demographic variations. Furthermore, to ensure the perturbations were accurate, we also propose a filtering strategy that validates each perturbation. By releasing DiversityMedQA, we provide a resource for evaluating and mitigating demographic bias in LLM medical diagnoses.
Paper Structure (30 sections, 2 figures, 21 tables)

This paper contains 30 sections, 2 figures, 21 tables.

Figures (2)

  • Figure 1: Flowchart of Data Perturbation using ChatGPT for Gender Modifications: Original prompt ($x$), Original generation $g(x)$, Perturbed prompt ($p(x)$), Perturbed original generation ($p(g(x))$), Perturbation generation ($g(p(x))$).
  • Figure 2: Flowchart of Data Perturbation using ChatGPT for Ethnicity Modifications: Original prompt ($x$), Original generation $g(x)$, Perturbed prompt 1 ($p_1(x)$), Perturbed prompt 2 ($p_2(x)$), Perturbed prompt 3 ($p_3(x)$), etc; Perturbation generation 1 ($g(p_1(x))$), Perturbation generation 2 ($g(p_2(x))$), Perturbation generation 3 ($g(p_3(x))$), etc