CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

Quan Tu; Shilong Fan; Zihang Tian; Rui Yan

CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

Quan Tu, Shilong Fan, Zihang Tian, Rui Yan

TL;DR

CharacterEval provides a Chinese RPCA benchmark with a high-quality, multi-turn dialogue dataset (77 characters across 1,785 conversations and ~11k–11.4k examples) extracted via GPT-4 and refined by humans, complemented by Baidu Baike profiles. It introduces a 13-metric, four-dimension evaluation framework and a character-based reward model (CharacterRM) to align subjective judgments with automated scoring. Experiments across ten LLMs reveal that Chinese LLMs can outperform GPT-4 in Chinese RPCA, with specialized role-playing models delivering the strongest performance in most dimensions. The work offers a rigorous, open resource for RPCA research, including data, evaluation frameworks, and baselines for reproducible benchmarking.

Abstract

Recently, the advent of large language models (LLMs) has revolutionized generative agents. Among them, Role-Playing Conversational Agents (RPCAs) attract considerable attention due to their ability to emotionally engage users. However, the absence of a comprehensive benchmark impedes progress in this field. To bridge this gap, we introduce CharacterEval, a Chinese benchmark for comprehensive RPCA assessment, complemented by a tailored high-quality dataset. The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 23,020 examples and featuring 77 characters derived from Chinese novels and scripts. It was carefully constructed, beginning with initial dialogue extraction via GPT-4, followed by rigorous human-led quality control, and enhanced with in-depth character profiles sourced from Baidu Baike. CharacterEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions. Comprehensive experiments on CharacterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation. Source code, data source and reward model will be publicly accessible at https://github.com/morecry/CharacterEval.

CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

TL;DR

Abstract

Paper Structure (21 sections, 1 equation, 4 figures, 5 tables)

This paper contains 21 sections, 1 equation, 4 figures, 5 tables.

Introduction
Related Work
Knowledge-based Dialogue
Personalized Dialogue
Character-based Dialogue
Problem Formulation
Data Collection
Evaluation Metric
Conversational Ability
Character Consistency
Role-playing Attractiveness
Personality Back-Testing
Experiment
Dataset Statistic
Experimental Setting
...and 6 more sections

Figures (4)

Figure 1: An example of the CharacterEval, including the dialogue, scene and character's profile.
Figure 2: Evaluation system of CharacterEval. "Know-" is the abbreviation of "Knowledge".
Figure 3: The comprehensive comparison of LLMs on four dimensions. Since CharacterGLM can not successfully complete personality back-testing, we mark the result using 'X' instead.
Figure 4: Model performance across the different stage of the conversation.

CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

TL;DR

Abstract

CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)