Assessing Empathy in Large Language Models with Real-World Physician-Patient Interactions
Man Luo, Christopher J. Warren, Lu Cheng, Haidar M. Abdul-Muhsin, Imon Banerjee
TL;DR
This work investigates the empathetic quality of large language model (LLM) responses in real-world healthcare interactions by comparing ChatGPT outputs to Mayo Clinic physicians using a de-identified prostate cancer dataset. It introduces LLaMA-EMRank, a non-finetuning-based empathy ranking framework that employs zero-shot, one-shot, few-shot, and ensemble in-context learning with LLaMA, alongside perplexity-based fluency measures and human patient evaluations. Across automatic metrics and human judgments, ChatGPT often appears more empathetic than physicians, though human assessments show substantial subjectivity. The study highlights the potential of LLM-powered chatbots to enhance patient support and reduce clinician burnout, while underscoring the need for robust, domain-aware empathy metrics and careful deployment considerations.
Abstract
The integration of Large Language Models (LLMs) into the healthcare domain has the potential to significantly enhance patient care and support through the development of empathetic, patient-facing chatbots. This study investigates an intriguing question Can ChatGPT respond with a greater degree of empathy than those typically offered by physicians? To answer this question, we collect a de-identified dataset of patient messages and physician responses from Mayo Clinic and generate alternative replies using ChatGPT. Our analyses incorporate novel empathy ranking evaluation (EMRank) involving both automated metrics and human assessments to gauge the empathy level of responses. Our findings indicate that LLM-powered chatbots have the potential to surpass human physicians in delivering empathetic communication, suggesting a promising avenue for enhancing patient care and reducing professional burnout. The study not only highlights the importance of empathy in patient interactions but also proposes a set of effective automatic empathy ranking metrics, paving the way for the broader adoption of LLMs in healthcare.
