Table of Contents
Fetching ...

Assessing Empathy in Large Language Models with Real-World Physician-Patient Interactions

Man Luo, Christopher J. Warren, Lu Cheng, Haidar M. Abdul-Muhsin, Imon Banerjee

TL;DR

This work investigates the empathetic quality of large language model (LLM) responses in real-world healthcare interactions by comparing ChatGPT outputs to Mayo Clinic physicians using a de-identified prostate cancer dataset. It introduces LLaMA-EMRank, a non-finetuning-based empathy ranking framework that employs zero-shot, one-shot, few-shot, and ensemble in-context learning with LLaMA, alongside perplexity-based fluency measures and human patient evaluations. Across automatic metrics and human judgments, ChatGPT often appears more empathetic than physicians, though human assessments show substantial subjectivity. The study highlights the potential of LLM-powered chatbots to enhance patient support and reduce clinician burnout, while underscoring the need for robust, domain-aware empathy metrics and careful deployment considerations.

Abstract

The integration of Large Language Models (LLMs) into the healthcare domain has the potential to significantly enhance patient care and support through the development of empathetic, patient-facing chatbots. This study investigates an intriguing question Can ChatGPT respond with a greater degree of empathy than those typically offered by physicians? To answer this question, we collect a de-identified dataset of patient messages and physician responses from Mayo Clinic and generate alternative replies using ChatGPT. Our analyses incorporate novel empathy ranking evaluation (EMRank) involving both automated metrics and human assessments to gauge the empathy level of responses. Our findings indicate that LLM-powered chatbots have the potential to surpass human physicians in delivering empathetic communication, suggesting a promising avenue for enhancing patient care and reducing professional burnout. The study not only highlights the importance of empathy in patient interactions but also proposes a set of effective automatic empathy ranking metrics, paving the way for the broader adoption of LLMs in healthcare.

Assessing Empathy in Large Language Models with Real-World Physician-Patient Interactions

TL;DR

This work investigates the empathetic quality of large language model (LLM) responses in real-world healthcare interactions by comparing ChatGPT outputs to Mayo Clinic physicians using a de-identified prostate cancer dataset. It introduces LLaMA-EMRank, a non-finetuning-based empathy ranking framework that employs zero-shot, one-shot, few-shot, and ensemble in-context learning with LLaMA, alongside perplexity-based fluency measures and human patient evaluations. Across automatic metrics and human judgments, ChatGPT often appears more empathetic than physicians, though human assessments show substantial subjectivity. The study highlights the potential of LLM-powered chatbots to enhance patient support and reduce clinician burnout, while underscoring the need for robust, domain-aware empathy metrics and careful deployment considerations.

Abstract

The integration of Large Language Models (LLMs) into the healthcare domain has the potential to significantly enhance patient care and support through the development of empathetic, patient-facing chatbots. This study investigates an intriguing question Can ChatGPT respond with a greater degree of empathy than those typically offered by physicians? To answer this question, we collect a de-identified dataset of patient messages and physician responses from Mayo Clinic and generate alternative replies using ChatGPT. Our analyses incorporate novel empathy ranking evaluation (EMRank) involving both automated metrics and human assessments to gauge the empathy level of responses. Our findings indicate that LLM-powered chatbots have the potential to surpass human physicians in delivering empathetic communication, suggesting a promising avenue for enhancing patient care and reducing professional burnout. The study not only highlights the importance of empathy in patient interactions but also proposes a set of effective automatic empathy ranking metrics, paving the way for the broader adoption of LLMs in healthcare.
Paper Structure (31 sections, 1 equation, 7 figures, 6 tables)

This paper contains 31 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Given a patient message, we prompt ChatGPT for a response. We restrict the length of the ChatGPT response to mimic the statistics of the physician's response (See Table \ref{['tab:dataset_statistic']}). Both ChatGPT's and a physician's responses are then evaluated using a multi-dimension LLM-EMRank metric for automatic ranking (LLaMA Empathy Evaluation). In addition. we also conduct a human empathy evaluation to ensure a thorough and rigorous assessment.
  • Figure 2: In Context Learning Example: Patient is given two responses and assesses which response is more empathetic and provides the justification. Note that patients do not know which response is from ChatGPT or the physician when evaluating empathy.
  • Figure 3: The Pearson values between Automatic Metric and Human Judgement.
  • Figure 4: Given the responses from ChatGPT and Physician, human, and all other automatic metrics rate ChatGPT's response as being more empathetic. Note that patients do not know which response is from ChatGPT or the physician.
  • Figure 5: Given the responses from ChatGPT and Physician, humans rate the physician's response as being more empathetic while all other automatic metrics rate ChatGPT's response as more empathetic. Note that patients do not know which response is from ChatGPT or the physician.
  • ...and 2 more figures