Table of Contents
Fetching ...

CharacterBench: Benchmarking Character Customization of Large Language Models

Jinfeng Zhou, Yongkang Huang, Bosi Wen, Guanqun Bi, Yuxuan Chen, Pei Ke, Zhuang Chen, Xiyao Xiao, Libiao Peng, Kuntian Tang, Rongsheng Zhang, Le Zhang, Tangjie Lv, Zhipeng Hu, Hongning Wang, Minlie Huang

TL;DR

CharacterBench addresses the inadequacy of existing character customization benchmarks by introducing the largest bilingual generative benchmark (22,859 samples across 3,956 characters and 25 categories) that decomposes the ability into 11 evaluation dimensions over 6 aspects (memory, knowledge, persona, emotion, morality, believability). It achieves robust, efficient evaluation through dimension-specific queries (targeted for sparse dimensions and natural prompts for dense ones) and a specialized CharacterJudge model that outperforms automatic judges in correlating with human judgments. The framework integrates diverse data sources (human role-play, human-prototype interactions, and literary extractions) and applies rigorous quality control, translation, and development of a DPO-ready benchmark. Empirical results show strong alignment with human evaluation, competitive performance by open-source LLMs, and clear potential for optimizing LLMs’ character customization. Ethical and methodological considerations are addressed through careful data governance, translation fidelity, and controlled usage for research purposes.

Abstract

Character-based dialogue (aka role-playing) enables users to freely customize characters for interaction, which often relies on LLMs, raising the need to evaluate LLMs' character customization capability. However, existing benchmarks fail to ensure a robust evaluation as they often only involve a single character category or evaluate limited dimensions. Moreover, the sparsity of character features in responses makes feature-focused generative evaluation both ineffective and inefficient. To address these issues, we propose CharacterBench, the largest bilingual generative benchmark, with 22,859 human-annotated samples covering 3,956 characters from 25 detailed character categories. We define 11 dimensions of 6 aspects, classified as sparse and dense dimensions based on whether character features evaluated by specific dimensions manifest in each response. We enable effective and efficient evaluation by crafting tailored queries for each dimension to induce characters' responses related to specific dimensions. Further, we develop CharacterJudge model for cost-effective and stable evaluations. Experiments show its superiority over SOTA automatic judges (e.g., GPT-4) and our benchmark's potential to optimize LLMs' character customization. Our repository is at https://github.com/thu-coai/CharacterBench.

CharacterBench: Benchmarking Character Customization of Large Language Models

TL;DR

CharacterBench addresses the inadequacy of existing character customization benchmarks by introducing the largest bilingual generative benchmark (22,859 samples across 3,956 characters and 25 categories) that decomposes the ability into 11 evaluation dimensions over 6 aspects (memory, knowledge, persona, emotion, morality, believability). It achieves robust, efficient evaluation through dimension-specific queries (targeted for sparse dimensions and natural prompts for dense ones) and a specialized CharacterJudge model that outperforms automatic judges in correlating with human judgments. The framework integrates diverse data sources (human role-play, human-prototype interactions, and literary extractions) and applies rigorous quality control, translation, and development of a DPO-ready benchmark. Empirical results show strong alignment with human evaluation, competitive performance by open-source LLMs, and clear potential for optimizing LLMs’ character customization. Ethical and methodological considerations are addressed through careful data governance, translation fidelity, and controlled usage for research purposes.

Abstract

Character-based dialogue (aka role-playing) enables users to freely customize characters for interaction, which often relies on LLMs, raising the need to evaluate LLMs' character customization capability. However, existing benchmarks fail to ensure a robust evaluation as they often only involve a single character category or evaluate limited dimensions. Moreover, the sparsity of character features in responses makes feature-focused generative evaluation both ineffective and inefficient. To address these issues, we propose CharacterBench, the largest bilingual generative benchmark, with 22,859 human-annotated samples covering 3,956 characters from 25 detailed character categories. We define 11 dimensions of 6 aspects, classified as sparse and dense dimensions based on whether character features evaluated by specific dimensions manifest in each response. We enable effective and efficient evaluation by crafting tailored queries for each dimension to induce characters' responses related to specific dimensions. Further, we develop CharacterJudge model for cost-effective and stable evaluations. Experiments show its superiority over SOTA automatic judges (e.g., GPT-4) and our benchmark's potential to optimize LLMs' character customization. Our repository is at https://github.com/thu-coai/CharacterBench.

Paper Structure

This paper contains 45 sections, 3 equations, 3 figures, 60 tables.

Figures (3)

  • Figure 1: Evaluation framework of our CharacterBench and an illustration of how it checks boundary consistency. Dense and sparse dimensions are classified by whether the character features evaluated by specific dimensions always manifest in each response. We enable effective and efficient evaluation by crafting tailored queries for each dimension.
  • Figure 2: Construction pipeline of our CharacterBench, which is clearer clarified in the "Overview" subsection below.
  • Figure 3: Category distributions of characters in CharacterBench, with 4 main categories and 25 sub-categories.