RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines

Pengfei Yu; Dongming Shen; Silin Meng; Jaewon Lee; Weisu Yin; Andrea Yaoyun Cui; Zhenlin Xu; Yi Zhu; Xingjian Shi; Mu Li; Alex Smola

RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines

Pengfei Yu, Dongming Shen, Silin Meng, Jaewon Lee, Weisu Yin, Andrea Yaoyun Cui, Zhenlin Xu, Yi Zhu, Xingjian Shi, Mu Li, Alex Smola

TL;DR

RPGBench introduces a benchmark to evaluate large language models as text-based RPG engines, featuring two core tasks (GC and GS) and a two-stage validity pipeline with a BFS checker to ensure mechanically sound game worlds. It integrates an event–state representation for mechanics with a multi-round GS framework that generates narrative content, candidate actions, and dynamic state updates, while assessing both objective mechanical correctness and subjective quality via LLM judges and human studies. The dataset composits 100 NPC-based prompts yielding 125 valid games, enabling robust GS evaluation, and the results reveal LLMs generate engaging narratives yet struggle with long-horizon, fully verifiable mechanics. The work contributes a hybrid evaluation suite and a formal mechanical-check mechanism that together advance the development of controllable, immersive, text-based RPGs.

Abstract

We present RPGBench, the first benchmark designed to evaluate large language models (LLMs) as text-based role-playing game (RPG) engines. RPGBench comprises two core tasks: Game Creation (GC) and Game Simulation (GS). In GC, an LLM must craft a valid and playable RPG world using a structured event-state representation, ensuring logical coherence and proper termination conditions. In GS, the LLM simulates interactive gameplay across multiple rounds while consistently updating states and enforcing game rules. To comprehensively assess performance, RPGBench integrates objective and subjective evaluation methodologies. Objective measures verify adherence to event mechanics and check variable updates without requiring human intervention. Subjective measures, such as content interestingness, action quality, and role-playing capability, are evaluated via an LLM-as-a-judge framework, where a strong LLM grades each candidate's outputs. Empirical results demonstrate that state-of-the-art LLMs can produce engaging stories but often struggle to implement consistent, verifiable game mechanics, particularly in long or complex scenarios. By combining structured, rule-based assessments with LLM-based judgments, RPGBench provides a new standard for evaluating how well LLMs can balance creativity, coherence, and complexity in text-based RPGs, opening avenues for more immersive and controllable interactive storytelling.

RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines

TL;DR

Abstract

RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)