Table of Contents
Fetching ...

CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

Quan Tu, Shilong Fan, Zihang Tian, Rui Yan

TL;DR

CharacterEval provides a Chinese RPCA benchmark with a high-quality, multi-turn dialogue dataset (77 characters across 1,785 conversations and ~11k–11.4k examples) extracted via GPT-4 and refined by humans, complemented by Baidu Baike profiles. It introduces a 13-metric, four-dimension evaluation framework and a character-based reward model (CharacterRM) to align subjective judgments with automated scoring. Experiments across ten LLMs reveal that Chinese LLMs can outperform GPT-4 in Chinese RPCA, with specialized role-playing models delivering the strongest performance in most dimensions. The work offers a rigorous, open resource for RPCA research, including data, evaluation frameworks, and baselines for reproducible benchmarking.

Abstract

Recently, the advent of large language models (LLMs) has revolutionized generative agents. Among them, Role-Playing Conversational Agents (RPCAs) attract considerable attention due to their ability to emotionally engage users. However, the absence of a comprehensive benchmark impedes progress in this field. To bridge this gap, we introduce CharacterEval, a Chinese benchmark for comprehensive RPCA assessment, complemented by a tailored high-quality dataset. The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 23,020 examples and featuring 77 characters derived from Chinese novels and scripts. It was carefully constructed, beginning with initial dialogue extraction via GPT-4, followed by rigorous human-led quality control, and enhanced with in-depth character profiles sourced from Baidu Baike. CharacterEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions. Comprehensive experiments on CharacterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation. Source code, data source and reward model will be publicly accessible at https://github.com/morecry/CharacterEval.

CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

TL;DR

CharacterEval provides a Chinese RPCA benchmark with a high-quality, multi-turn dialogue dataset (77 characters across 1,785 conversations and ~11k–11.4k examples) extracted via GPT-4 and refined by humans, complemented by Baidu Baike profiles. It introduces a 13-metric, four-dimension evaluation framework and a character-based reward model (CharacterRM) to align subjective judgments with automated scoring. Experiments across ten LLMs reveal that Chinese LLMs can outperform GPT-4 in Chinese RPCA, with specialized role-playing models delivering the strongest performance in most dimensions. The work offers a rigorous, open resource for RPCA research, including data, evaluation frameworks, and baselines for reproducible benchmarking.

Abstract

Recently, the advent of large language models (LLMs) has revolutionized generative agents. Among them, Role-Playing Conversational Agents (RPCAs) attract considerable attention due to their ability to emotionally engage users. However, the absence of a comprehensive benchmark impedes progress in this field. To bridge this gap, we introduce CharacterEval, a Chinese benchmark for comprehensive RPCA assessment, complemented by a tailored high-quality dataset. The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 23,020 examples and featuring 77 characters derived from Chinese novels and scripts. It was carefully constructed, beginning with initial dialogue extraction via GPT-4, followed by rigorous human-led quality control, and enhanced with in-depth character profiles sourced from Baidu Baike. CharacterEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions. Comprehensive experiments on CharacterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation. Source code, data source and reward model will be publicly accessible at https://github.com/morecry/CharacterEval.
Paper Structure (21 sections, 1 equation, 4 figures, 5 tables)

This paper contains 21 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An example of the CharacterEval, including the dialogue, scene and character's profile.
  • Figure 2: Evaluation system of CharacterEval. "Know-" is the abbreviation of "Knowledge".
  • Figure 3: The comprehensive comparison of LLMs on four dimensions. Since CharacterGLM can not successfully complete personality back-testing, we mark the result using 'X' instead.
  • Figure 4: Model performance across the different stage of the conversation.