Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games

Dekun Wu; Haochen Shi; Zhiyuan Sun; Bang Liu

Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games

Dekun Wu, Haochen Shi, Zhiyuan Sun, Bang Liu

TL;DR

This work tackles the challenge of deploying large language model (LLM)–based agents in Jubensha, a complex Chinese detective role‑playing game. It introduces a Chinese Jubensha dataset and a ThinkThrice multi‑agent framework enabling autonomous agent interactions, memory‑driven reasoning, and self‑verification within a constrained game loop. The authors design objective evaluations—Factual and Inferential Question Answering—to quantify information mastery and reasoning, and show that modules for memory retrieval, self‑refinement, and self‑verification improve information gathering, murderer identification, and reasoning, particularly when using GPT‑4. The study provides a new benchmark and methodology for evaluating LLM agents in narrative, adversarial settings, with implications for AI agents in social reasoning tasks and game AI research.

Abstract

In this study, we explore the application of Large Language Models (LLMs) in \textit{Jubensha}, a Chinese detective role-playing game and a novel area in Artificial Intelligence (AI) driven gaming. We introduce the first dataset specifically for Jubensha, including character scripts and game rules, to foster AI agent development in this complex narrative environment. Our work also presents a unique multi-agent interaction framework using LLMs, allowing AI agents to autonomously engage in this game. To evaluate the gaming performance of these AI agents, we developed novel methods measuring their mastery of case information and reasoning skills. Furthermore, we incorporated the latest advancements in in-context learning to improve the agents' performance in information gathering, murderer identification, and logical reasoning. The experimental results validate the effectiveness of our proposed methods. This work aims to offer a novel perspective on understanding LLM capabilities and establish a new benchmark for evaluating large language model-based agents.

Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games

TL;DR

Abstract

Paper Structure (33 sections, 1 equation, 7 figures, 27 tables)

This paper contains 33 sections, 1 equation, 7 figures, 27 tables.

Introduction
Related Work
Interactive Role-Playing Games
LLM-based Autonomous Agents.
Evaluating LLM-based Agents.
Jubensha Dataset
Background of Jubensha Game
Dataset Construction
The ThinkThrice Framework for Jubensha Games
Memory Retrieval
Self-Refinement
Self-Verification
Evaluating LLM-based Agents in Jubensha Games
Factual Question Answering
Inferential Question Answering
...and 18 more sections

Figures (7)

Figure 1: Illustration of the Jubensha game. It requires players to interact with each other and reason about who is the murderer in a story.
Figure 2: Illustration of our proposed ThinkThrice framework for enhancing agent's performance in multi-agent detective games (i.e., Jubensha). The three different colors of the arrows indicate the data flows of three stages: 1) Initial answer generation with Memory Retrieval; 2) Enhance answer with Self-Refinement; 3) Verify answer with Self-Verification. The brown texts in the refined answer are new information added to the initial answer.
Figure 3: Average win rate of civilian players and the average murderer identification accuracy across different architectures in Jubensha games.
Figure 4: GPT-3.5 and GPT-4's performance with different methods, where overall accuracy measure the raw correct percentage and informed accuracy take LLM's reasoning ability into consideration. FSA stands for 'Full Script Access', indicating that agents have access to the complete scripts of all players.
Figure 5: Human Evaluation on the Quality of Agents' Responses.
...and 2 more figures

Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games

TL;DR

Abstract

Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games

Authors

TL;DR

Abstract

Table of Contents

Figures (7)