Multi-dimensional Evaluation of Empathetic Dialog Responses

Zhichao Xu; Jiepu Jiang

Multi-dimensional Evaluation of Empathetic Dialog Responses

Zhichao Xu, Jiepu Jiang

TL;DR

The paper addresses the challenge of evaluating empathy in dialogue by proposing a two-dimensional framework that captures both expressed empathy (speaker intents) and perceived empathy (listener-perceived aspects: Engagement, Understanding, Sympathy, Helpfulness). It validates this framework on an internal customer-service corpus and public empathy datasets, showing perceived empathy strongly relates to conversation satisfaction. Through extensive experiments, it compares prompting LLMs and various language-model classifiers, finding instruction-finetuned Flan-T5 models outperform prompting methods and prior baselines. The work demonstrates the value of context, natural-language instructions, and tailored loss functions for accurate, scalable empathy measurement, with implications for evaluating and improving empathetic dialogue in real-world applications. It also identifies limits of current LLM prompting and outlines pathways to apply the framework to human-machine interactions in the future.

Abstract

Empathy is critical for effective and satisfactory conversational communication. Prior efforts to measure conversational empathy mostly focus on expressed communicative intents -- that is, the way empathy is expressed. Yet, these works ignore the fact that conversation is also a collaboration involving both speakers and listeners. In contrast, we propose a multi-dimensional empathy evaluation framework to measure both \emph{expressed intents from the speaker's perspective} and \emph{perceived empathy from the listener's perspective}. We apply our analytical framework to examine internal customer-service dialogues. We find the two dimensions (expressed intent types and perceived empathy) are inter-connected, while perceived empathy has high correlations with dialogue satisfaction levels. To reduce the annotation cost, we explore different options to automatically measure conversational empathy: prompting LLMs and training language model-based classifiers. Our experiments show that prompting methods with even popular models like GPT-4 and Flan family models perform relatively poorly on both public and our internal datasets. In contrast, instruction-finetuned classifiers based on Flan-T5 family models outperform prior works and competitive baselines. We conduct a detailed ablation study to give more insights into instruction finetuning method's strong performance.

Multi-dimensional Evaluation of Empathetic Dialog Responses

TL;DR

Abstract

Multi-dimensional Evaluation of Empathetic Dialog Responses

Authors

TL;DR

Abstract

Table of Contents