Are Large Language Models More Empathetic than Humans?

Anuradha Welivita; Pearl Pu

Are Large Language Models More Empathetic than Humans?

Anuradha Welivita, Pearl Pu

TL;DR

This study evaluates whether contemporary LLMs can surpass humans in empathetic responding by conducting a large-scale, between-subjects user study (N=$1000$) with $2000$ prompts drawn from the EmpatheticDialogues dataset across $32$ emotions. Four state-of-the-art LLMs (GPT-4, LLaMA-2-70B-Chat, Gemini-1.0-Pro, Mixtral-8x7B-Instruct) are compared against human responses, using a 3-point empathy rating scale and chi-square analysis to assess differences in Bad/Okay/Good ratings. Results show LLMs outperform humans on Good ratings overall, with GPT-4 yielding the largest gains (≈$31 extrm{%}$) and robust gains across many positive emotions; gains persist across both positive and negative emotions, though magnitudes vary by model and emotion. The authors present a scalable evaluation framework for future LLM empathy assessment, discuss ethical considerations, and provide a data-release plan to support reproducibility and ongoing benchmarking. The work suggests substantial potential for empathetic AI in domains like customer service and mental-healthSupport, while underscoring the need for caution and continual bias monitoring in sensitive contexts.

Abstract

With the emergence of large language models (LLMs), investigating if they can surpass humans in areas such as emotion recognition and empathetic responding has become a focal point of research. This paper presents a comprehensive study exploring the empathetic responding capabilities of four state-of-the-art LLMs: GPT-4, LLaMA-2-70B-Chat, Gemini-1.0-Pro, and Mixtral-8x7B-Instruct in comparison to a human baseline. We engaged 1,000 participants in a between-subjects user study, assessing the empathetic quality of responses generated by humans and the four LLMs to 2,000 emotional dialogue prompts meticulously selected to cover a broad spectrum of 32 distinct positive and negative emotions. Our findings reveal a statistically significant superiority of the empathetic responding capability of LLMs over humans. GPT-4 emerged as the most empathetic, marking approximately 31% increase in responses rated as "Good" compared to the human benchmark. It was followed by LLaMA-2, Mixtral-8x7B, and Gemini-Pro, which showed increases of approximately 24%, 21%, and 10% in "Good" ratings, respectively. We further analyzed the response ratings at a finer granularity and discovered that some LLMs are significantly better at responding to specific emotions compared to others. The suggested evaluation framework offers a scalable and adaptable approach for assessing the empathy of new LLMs, avoiding the need to replicate this study's findings in future research.

Are Large Language Models More Empathetic than Humans?

TL;DR

This study evaluates whether contemporary LLMs can surpass humans in empathetic responding by conducting a large-scale, between-subjects user study (N=

) with

prompts drawn from the EmpatheticDialogues dataset across

emotions. Four state-of-the-art LLMs (GPT-4, LLaMA-2-70B-Chat, Gemini-1.0-Pro, Mixtral-8x7B-Instruct) are compared against human responses, using a 3-point empathy rating scale and chi-square analysis to assess differences in Bad/Okay/Good ratings. Results show LLMs outperform humans on Good ratings overall, with GPT-4 yielding the largest gains (≈

) and robust gains across many positive emotions; gains persist across both positive and negative emotions, though magnitudes vary by model and emotion. The authors present a scalable evaluation framework for future LLM empathy assessment, discuss ethical considerations, and provide a data-release plan to support reproducibility and ongoing benchmarking. The work suggests substantial potential for empathetic AI in domains like customer service and mental-healthSupport, while underscoring the need for caution and continual bias monitoring in sensitive contexts.

Abstract

Paper Structure (26 sections, 13 figures, 10 tables)

This paper contains 26 sections, 13 figures, 10 tables.

Introduction
Literature Review
The Dataset
Experiment Design
Between-Subjects vs Within-Subjects
Selection of the Rating Scale
Task Design
Quality Control
Statistical Test and Sample Size
Results
Case Study
Discussion
Limitations
Ethical Considerations
Distribution of Emotions
...and 11 more sections

Figures (13)

Figure 1: Between-subjects experiment design to evaluate the level of empathy demonstrated by LLMs compared to a human baseline when responding to emotional situations.
Figure 2: The Good, Okay, and Bad rating counts corresponding to the responses generated by humans, GPT-4, LLaMA-2, Gemini-Pro, and Mixtral-8x7B. The percentage gains of the LLMs' response ratings compared to the humans' response ratings are indicated at the top of each bar. The gains indicated in red are statistically significant.
Figure 3: The Good, Okay, and Bad rating counts corresponding to the responses generated by humans, GPT-4, LLaMA-2, Gemini-Pro, and Mixtral-8x7B for positive and negative emotional dialogue prompts.
Figure 4: Distribution of the dialogue prompt-response pairs sampled from the EmpatheticDialogues dataset across the 32 positive and negative emotions.
Figure 5: The description of the task.
...and 8 more figures

Are Large Language Models More Empathetic than Humans?

TL;DR

Abstract

Are Large Language Models More Empathetic than Humans?

Authors

TL;DR

Abstract

Table of Contents

Figures (13)