Are Large Language Models More Empathetic than Humans?
Anuradha Welivita, Pearl Pu
TL;DR
This study evaluates whether contemporary LLMs can surpass humans in empathetic responding by conducting a large-scale, between-subjects user study (N=$1000$) with $2000$ prompts drawn from the EmpatheticDialogues dataset across $32$ emotions. Four state-of-the-art LLMs (GPT-4, LLaMA-2-70B-Chat, Gemini-1.0-Pro, Mixtral-8x7B-Instruct) are compared against human responses, using a 3-point empathy rating scale and chi-square analysis to assess differences in Bad/Okay/Good ratings. Results show LLMs outperform humans on Good ratings overall, with GPT-4 yielding the largest gains (≈$31 extrm{%}$) and robust gains across many positive emotions; gains persist across both positive and negative emotions, though magnitudes vary by model and emotion. The authors present a scalable evaluation framework for future LLM empathy assessment, discuss ethical considerations, and provide a data-release plan to support reproducibility and ongoing benchmarking. The work suggests substantial potential for empathetic AI in domains like customer service and mental-healthSupport, while underscoring the need for caution and continual bias monitoring in sensitive contexts.
Abstract
With the emergence of large language models (LLMs), investigating if they can surpass humans in areas such as emotion recognition and empathetic responding has become a focal point of research. This paper presents a comprehensive study exploring the empathetic responding capabilities of four state-of-the-art LLMs: GPT-4, LLaMA-2-70B-Chat, Gemini-1.0-Pro, and Mixtral-8x7B-Instruct in comparison to a human baseline. We engaged 1,000 participants in a between-subjects user study, assessing the empathetic quality of responses generated by humans and the four LLMs to 2,000 emotional dialogue prompts meticulously selected to cover a broad spectrum of 32 distinct positive and negative emotions. Our findings reveal a statistically significant superiority of the empathetic responding capability of LLMs over humans. GPT-4 emerged as the most empathetic, marking approximately 31% increase in responses rated as "Good" compared to the human benchmark. It was followed by LLaMA-2, Mixtral-8x7B, and Gemini-Pro, which showed increases of approximately 24%, 21%, and 10% in "Good" ratings, respectively. We further analyzed the response ratings at a finer granularity and discovered that some LLMs are significantly better at responding to specific emotions compared to others. The suggested evaluation framework offers a scalable and adaptable approach for assessing the empathy of new LLMs, avoiding the need to replicate this study's findings in future research.
