CharacterBox: Evaluating the Role-Playing Capabilities of LLMs in Text-Based Virtual Worlds
Lei Wang, Jianxun Lian, Yi Huang, Yanqi Dai, Haoxuan Li, Xu Chen, Xing Xie, Ji-Rong Wen
TL;DR
CharacterBox addresses the challenge of evaluating LLM-based role-playing by creating a dynamic, text-based multi-agent virtual world. It formalizes scenes as $S = \\{E, C\\}$ and trajectories as $\\tau = \\{E, c, {o_1,a_1}, ..., {o_n,a_n}\\}$ to capture evolving interactions between environment and characters. The framework integrates a character agent and a narrator world model, plus two trajectory-based fine-tuning methods—Guided Trajectory Fine-tuning and Reflective Trajectory Fine-tuning—together with cost-efficient components CharacterNR and CharacterRM that reduce API dependence. Empirical results show reliable, valid assessments across languages, with trajectory-guided improvements enabling smaller models to reach competitive performance and enabling scalable, self-contained evaluation for diverse scenes.
Abstract
Role-playing is a crucial capability of Large Language Models (LLMs), enabling a wide range of practical applications, including intelligent non-player characters, digital twins, and emotional companions. Evaluating this capability in LLMs is challenging due to the complex dynamics involved in role-playing, such as maintaining character fidelity throughout a storyline and navigating open-ended narratives without a definitive ground truth. Current evaluation methods, which primarily focus on question-answering or conversational snapshots, fall short of adequately capturing the nuanced character traits and behaviors essential for authentic role-playing. In this paper, we propose CharacterBox, which is a simulation sandbox designed to generate situational fine-grained character behavior trajectories. These behavior trajectories enable a more comprehensive and in-depth evaluation of role-playing capabilities. CharacterBox consists of two main components: the character agent and the narrator agent. The character agent, grounded in psychological and behavioral science, exhibits human-like behaviors, while the narrator agent coordinates interactions between character agents and environmental changes. Additionally, we introduce two trajectory-based methods that leverage CharacterBox to enhance LLM performance. To reduce costs and facilitate the adoption of CharacterBox by public communities, we fine-tune two smaller models, CharacterNR and CharacterRM, as substitutes for GPT API calls, and demonstrate their competitive performance compared to advanced GPT APIs.
