Table of Contents
Fetching ...

CharacterBox: Evaluating the Role-Playing Capabilities of LLMs in Text-Based Virtual Worlds

Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haoxuan Li, Xu Chen, Xing Xie, Ji-Rong Wen

TL;DR

CharacterBox addresses the challenge of evaluating LLM-based role-playing by creating a dynamic, text-based multi-agent virtual world. It formalizes scenes as $S = \\{E, C\\}$ and trajectories as $\\tau = \\{E, c, {o_1,a_1}, ..., {o_n,a_n}\\}$ to capture evolving interactions between environment and characters. The framework integrates a character agent and a narrator world model, plus two trajectory-based fine-tuning methods—Guided Trajectory Fine-tuning and Reflective Trajectory Fine-tuning—together with cost-efficient components CharacterNR and CharacterRM that reduce API dependence. Empirical results show reliable, valid assessments across languages, with trajectory-guided improvements enabling smaller models to reach competitive performance and enabling scalable, self-contained evaluation for diverse scenes.

Abstract

Role-playing is a crucial capability of Large Language Models (LLMs), enabling a wide range of practical applications, including intelligent non-player characters, digital twins, and emotional companions. Evaluating this capability in LLMs is challenging due to the complex dynamics involved in role-playing, such as maintaining character fidelity throughout a storyline and navigating open-ended narratives without a definitive ground truth. Current evaluation methods, which primarily focus on question-answering or conversational snapshots, fall short of adequately capturing the nuanced character traits and behaviors essential for authentic role-playing. In this paper, we propose CharacterBox, which is a simulation sandbox designed to generate situational fine-grained character behavior trajectories. These behavior trajectories enable a more comprehensive and in-depth evaluation of role-playing capabilities. CharacterBox consists of two main components: the character agent and the narrator agent. The character agent, grounded in psychological and behavioral science, exhibits human-like behaviors, while the narrator agent coordinates interactions between character agents and environmental changes. Additionally, we introduce two trajectory-based methods that leverage CharacterBox to enhance LLM performance. To reduce costs and facilitate the adoption of CharacterBox by public communities, we fine-tune two smaller models, CharacterNR and CharacterRM, as substitutes for GPT API calls, and demonstrate their competitive performance compared to advanced GPT APIs.

CharacterBox: Evaluating the Role-Playing Capabilities of LLMs in Text-Based Virtual Worlds

TL;DR

CharacterBox addresses the challenge of evaluating LLM-based role-playing by creating a dynamic, text-based multi-agent virtual world. It formalizes scenes as and trajectories as to capture evolving interactions between environment and characters. The framework integrates a character agent and a narrator world model, plus two trajectory-based fine-tuning methods—Guided Trajectory Fine-tuning and Reflective Trajectory Fine-tuning—together with cost-efficient components CharacterNR and CharacterRM that reduce API dependence. Empirical results show reliable, valid assessments across languages, with trajectory-guided improvements enabling smaller models to reach competitive performance and enabling scalable, self-contained evaluation for diverse scenes.

Abstract

Role-playing is a crucial capability of Large Language Models (LLMs), enabling a wide range of practical applications, including intelligent non-player characters, digital twins, and emotional companions. Evaluating this capability in LLMs is challenging due to the complex dynamics involved in role-playing, such as maintaining character fidelity throughout a storyline and navigating open-ended narratives without a definitive ground truth. Current evaluation methods, which primarily focus on question-answering or conversational snapshots, fall short of adequately capturing the nuanced character traits and behaviors essential for authentic role-playing. In this paper, we propose CharacterBox, which is a simulation sandbox designed to generate situational fine-grained character behavior trajectories. These behavior trajectories enable a more comprehensive and in-depth evaluation of role-playing capabilities. CharacterBox consists of two main components: the character agent and the narrator agent. The character agent, grounded in psychological and behavioral science, exhibits human-like behaviors, while the narrator agent coordinates interactions between character agents and environmental changes. Additionally, we introduce two trajectory-based methods that leverage CharacterBox to enhance LLM performance. To reduce costs and facilitate the adoption of CharacterBox by public communities, we fine-tune two smaller models, CharacterNR and CharacterRM, as substitutes for GPT API calls, and demonstrate their competitive performance compared to advanced GPT APIs.

Paper Structure

This paper contains 29 sections, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: A comparison of different role-playing facilities: (A) self-reported QA; (B) Conversations; and (C) CharacterBox. Unlike the other methods, CharacterBox not only prompts role agents for utterances and actions but also includes components to track environmental changes and coordinate interactions between role agents.
  • Figure 2: Performance comparison under Guided and Reflective Trajectory Fine-tuning across English and Chinese scenes.
  • Figure 3: comparison between GPT-3.5, CharacterNR and the base model Qwen2.5-7B.
  • Figure 4: Distribution of characters numbers in 100 scenes.
  • Figure 5: A case study demonstrates that CharacterBox can be extended to scenario simulations within average character in diverse contexts.