ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

Haiquan Zhao; Lingyu Li; Shisong Chen; Shuqi Kong; Jiaan Wang; Kexin Huang; Tianle Gu; Yixu Wang; Wang Jian; Dandan Liang; Zhixu Li; Yan Teng; Yanghua Xiao; Yingchun Wang

ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

Haiquan Zhao, Lingyu Li, Shisong Chen, Shuqi Kong, Jiaan Wang, Kexin Huang, Tianle Gu, Yixu Wang, Wang Jian, Dandan Liang, Zhixu Li, Yan Teng, Yanghua Xiao, Yingchun Wang

TL;DR

This work re-organizes role-playing cards from seven existing datasets and trains a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4, achieving a scoring performance surpassing 35 points of GPT-4.

Abstract

Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at https://github.com/AIFlames/Esc-Eval.

ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

TL;DR

Abstract

Paper Structure (38 sections, 19 figures, 16 tables)

This paper contains 38 sections, 19 figures, 16 tables.

Introduction
ESC-Eval
Framework Overview
Role Card Acquisition
Dataset collection
User cards extraction and filtering
Manual annotation and correction
ESC-Role
Data Collection
Implementation and Evaluation Metric
Evalution Results
Evaluation
Evaluating models
Evaluation Results
Correlation Analysis
...and 23 more sections

Figures (19)

Figure 1: Difference between our proposed evaluation framework and others.
Figure 2: Overview of ESC-Eval, which used role-playing to evaluate the capability of ESC models.
Figure 3: Win rate of different role-playing agents and source data, where source denotes human dialogue.
Figure 4: The framework of user-card construction. Firstly, the initial user cards are extracted from open-source datasets using GPT-4. In the second step, based on the scene classification we designed, GPT-4 is utilized to determine the category to which the character sheet data belongs, and further filtering is performed. In the third step, we employ crowdsourcing to annotate the category and subcategories of the scenes, and manually filter the user cards again.
Figure 5: Role cards distribution of our constructed benchmark.
...and 14 more figures

ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

TL;DR

Abstract

ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (19)