Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents
Chaoran Chen, Bingsheng Yao, Ruishi Zou, Wenyue Hua, Weimin Lyu, Yanfang Ye, Toby Jia-Jun Li, Dakuo Wang
TL;DR
The paper tackles the inconsistency in evaluating LLM-based Role-Playing Agents (RPAs) across diverse tasks and designs. It conducts a systematic literature review of 1,676 papers from 2021–2024, identifying six agent attributes, seven task attributes, and seven evaluation metrics, then develops an evidence-based two-step RPA evaluation design guideline that links metrics to attributes. Through case studies, it demonstrates how proper metric selection yields comprehensive, robust assessments while highlighting common pitfalls from flawed evaluations. The work discusses the relationships between agent attributes and downstream tasks, analyzes design considerations for RPA personas, and addresses the challenges of evaluating highly flexible, human-like agents, offering practical guidance for more reliable benchmarking and cross-task comparability.
Abstract
Role-Playing Agent (RPA) is an increasingly popular type of LLM Agent that simulates human-like behaviors in a variety of tasks. However, evaluating RPAs is challenging due to diverse task requirements and agent designs. This paper proposes an evidence-based, actionable, and generalizable evaluation design guideline for LLM-based RPA by systematically reviewing 1,676 papers published between Jan. 2021 and Dec. 2024. Our analysis identifies six agent attributes, seven task attributes, and seven evaluation metrics from existing literature. Based on these findings, we present an RPA evaluation design guideline to help researchers develop more systematic and consistent evaluation methods.
