Table of Contents
Fetching ...

Can Machines Resonate with Humans? Evaluating the Emotional and Empathic Comprehension of LMs

Muhammad Arslan Manzoor, Yuxia Wang, Minghan Wang, Preslav Nakov

TL;DR

A systematic exploration of LMs' understanding of empathy reveals substantial opportunities for further investigation in both task formulation and modeling, and finds that subjectivity in interpreting empathy among annotators appears to be independent of cultural background.

Abstract

Empathy plays a pivotal role in fostering prosocial behavior, often triggered by the sharing of personal experiences through narratives. However, modeling empathy using NLP approaches remains challenging due to its deep interconnection with human interaction dynamics. Previous approaches, which involve fine-tuning language models (LMs) on human-annotated empathic datasets, have had limited success. In our pursuit of improving empathy understanding in LMs, we propose several strategies, including contrastive learning with masked LMs and supervised fine-tuning with large language models. While these methods show improvements over previous methods, the overall results remain unsatisfactory. To better understand this trend, we performed an analysis which reveals a low agreement among annotators. This lack of consensus hinders training and highlights the subjective nature of the task. We also explore the cultural impact on annotations. To study this, we meticulously collected story pairs in Urdu language and find that subjectivity in interpreting empathy among annotators appears to be independent of cultural background. Our systematic exploration of LMs' understanding of empathy reveals substantial opportunities for further investigation in both task formulation and modeling.

Can Machines Resonate with Humans? Evaluating the Emotional and Empathic Comprehension of LMs

TL;DR

A systematic exploration of LMs' understanding of empathy reveals substantial opportunities for further investigation in both task formulation and modeling, and finds that subjectivity in interpreting empathy among annotators appears to be independent of cultural background.

Abstract

Empathy plays a pivotal role in fostering prosocial behavior, often triggered by the sharing of personal experiences through narratives. However, modeling empathy using NLP approaches remains challenging due to its deep interconnection with human interaction dynamics. Previous approaches, which involve fine-tuning language models (LMs) on human-annotated empathic datasets, have had limited success. In our pursuit of improving empathy understanding in LMs, we propose several strategies, including contrastive learning with masked LMs and supervised fine-tuning with large language models. While these methods show improvements over previous methods, the overall results remain unsatisfactory. To better understand this trend, we performed an analysis which reveals a low agreement among annotators. This lack of consensus hinders training and highlights the subjective nature of the task. We also explore the cultural impact on annotations. To study this, we meticulously collected story pairs in Urdu language and find that subjectivity in interpreting empathy among annotators appears to be independent of cultural background. Our systematic exploration of LMs' understanding of empathy reveals substantial opportunities for further investigation in both task formulation and modeling.
Paper Structure (44 sections, 9 figures, 12 tables)

This paper contains 44 sections, 9 figures, 12 tables.

Figures (9)

  • Figure 1: An ideal interaction between users and a system. A chatbot can resonate with a human, and a search engine can retrieve stories of similar experience.
  • Figure 2: Pearson and Spearsman correlation between overall empathic similarity, event, emotion and moral similarity. Moral similarity has the highest correlation with the empathic, followed by event and emotion.
  • Figure 3: Dev/Test set empathic similarity distribution: predictions of BART vs. SBERT vs. ground truth.
  • Figure 4: Gold-label Guided Explanation Generation Prompt using Llama3-70B-instruct.
  • Figure 5: Confusion matrix of fine-tuned Llama-3-8B on training set. The model could not estimate the corresponding score conditioned on the input story pair, but sampled a similarity class based on the gold class distribution of training data $P(Y)$ = (0.140, 0.399, 0.404, 0.057) whatever the input pair was, leading to randomness on seen and unseen cases prediction. It learned nothing but statistical $P(Y)$ of training set. See more in Section \ref{['sec:understandbottleneck']}.
  • ...and 4 more figures