Table of Contents
Fetching ...

Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare

Pardis Sadat Zahraei, Zahra Shakeri

TL;DR

The EthiClinician is developed, a fine-tuned model built on the ChatDoctor framework, which outperforms GPT-4 in both ethical reasoning and clinical judgment and sets a new benchmark for safer, more reliable patient outcomes.

Abstract

Biased AI-generated medical advice and misdiagnoses can jeopardize patient safety, making the integrity of AI in healthcare more critical than ever. As Large Language Models (LLMs) take on a growing role in medical decision-making, addressing their biases and enhancing their accuracy is key to delivering safe, reliable care. This study addresses these challenges head-on by introducing new resources designed to promote ethical and precise AI in healthcare. We present two datasets: BiasMD, featuring 6,007 question-answer pairs crafted to evaluate and mitigate biases in health-related LLM outputs, and DiseaseMatcher, with 32,000 clinical question-answer pairs spanning 700 diseases, aimed at assessing symptom-based diagnostic accuracy. Using these datasets, we developed the EthiClinician, a fine-tuned model built on the ChatDoctor framework, which outperforms GPT-4 in both ethical reasoning and clinical judgment. By exposing and correcting hidden biases in existing models for healthcare, our work sets a new benchmark for safer, more reliable patient outcomes.

Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare

TL;DR

The EthiClinician is developed, a fine-tuned model built on the ChatDoctor framework, which outperforms GPT-4 in both ethical reasoning and clinical judgment and sets a new benchmark for safer, more reliable patient outcomes.

Abstract

Biased AI-generated medical advice and misdiagnoses can jeopardize patient safety, making the integrity of AI in healthcare more critical than ever. As Large Language Models (LLMs) take on a growing role in medical decision-making, addressing their biases and enhancing their accuracy is key to delivering safe, reliable care. This study addresses these challenges head-on by introducing new resources designed to promote ethical and precise AI in healthcare. We present two datasets: BiasMD, featuring 6,007 question-answer pairs crafted to evaluate and mitigate biases in health-related LLM outputs, and DiseaseMatcher, with 32,000 clinical question-answer pairs spanning 700 diseases, aimed at assessing symptom-based diagnostic accuracy. Using these datasets, we developed the EthiClinician, a fine-tuned model built on the ChatDoctor framework, which outperforms GPT-4 in both ethical reasoning and clinical judgment. By exposing and correcting hidden biases in existing models for healthcare, our work sets a new benchmark for safer, more reliable patient outcomes.

Paper Structure

This paper contains 14 sections, 9 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: BiasMD Overview. Analysis of biases in health demographics across diverse populations, focusing on two candidate antecedents and the responses generated by LLMs.
  • Figure 2: Evaluation of LLMs on the BiasMD Dataset: The figure illustrates the accuracy of model responses across various demographic factors, including socioeconomics, sexuality, religion/belief, disability, and race/ethnicity. Accuracy here refers to the percentage of unbiased answers. EthiClinician achieved almost complete accuracy. GPT-4 followed with 90.1%, while Llama3-8B and Mixtral8x-7B scored 67.6% and 57.5%, respectively. GPT-3.5 Turbo achieved 23.91%, and Llama2-7B scored 1.1%. Medalpaca-7B and ChatDoctor both recorded 0% accuracy. These results underscore the ethical challenges faced by language models in the medical domain.
  • Figure 3: DiseaseMatcher Dataset Examples. Illustrative instances from the DiseaseMatcher dataset for determining disease likelihood in patients. This system evaluates two candidate patients with distinct symptom profiles and occasionally additional demographic attributes. The goal is to identify which patient is more likely to have a specific disease based on the provided information.
  • Figure 4: Evaluation of DiseaseMatcher on ChatDoctor and Llama2-7B Models. This figure presents examples of biased and harmful responses from the ChatDoctor model and instances where the Llama2-7B model refused to provide answers. The DiseaseMatcher dataset is designed to be bias-free, and refusal to answer a patient's query is considered unethical, as it constitutes discrimination and denial of service based on patient demographic unrelated to their potential disease and symptoms. The figure highlights how irrelevant information regarding race or belief can significantly impact the decision-making abilities of both models.
  • Figure 5: Model Performance on the DiseaseMatcher Dataset. Accuracy of different models in determining the correct diagnosis across various demographic attributes: Belief, Race, Status, and None (NA, indicating no demographic attribute provided). Darker colors represent the first patient being correctly diagnosed, while lighter colors represent the second patient being correctly diagnosed. Correct diagnosis means the given disease matches the symptoms of that patient. The x-axis shows the number of symptoms provided for each patient, ranging from 3 to 6.