Table of Contents
Fetching ...

Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games

Dekun Wu, Haochen Shi, Zhiyuan Sun, Bang Liu

TL;DR

This work tackles the challenge of deploying large language model (LLM)–based agents in Jubensha, a complex Chinese detective role‑playing game. It introduces a Chinese Jubensha dataset and a ThinkThrice multi‑agent framework enabling autonomous agent interactions, memory‑driven reasoning, and self‑verification within a constrained game loop. The authors design objective evaluations—Factual and Inferential Question Answering—to quantify information mastery and reasoning, and show that modules for memory retrieval, self‑refinement, and self‑verification improve information gathering, murderer identification, and reasoning, particularly when using GPT‑4. The study provides a new benchmark and methodology for evaluating LLM agents in narrative, adversarial settings, with implications for AI agents in social reasoning tasks and game AI research.

Abstract

In this study, we explore the application of Large Language Models (LLMs) in \textit{Jubensha}, a Chinese detective role-playing game and a novel area in Artificial Intelligence (AI) driven gaming. We introduce the first dataset specifically for Jubensha, including character scripts and game rules, to foster AI agent development in this complex narrative environment. Our work also presents a unique multi-agent interaction framework using LLMs, allowing AI agents to autonomously engage in this game. To evaluate the gaming performance of these AI agents, we developed novel methods measuring their mastery of case information and reasoning skills. Furthermore, we incorporated the latest advancements in in-context learning to improve the agents' performance in information gathering, murderer identification, and logical reasoning. The experimental results validate the effectiveness of our proposed methods. This work aims to offer a novel perspective on understanding LLM capabilities and establish a new benchmark for evaluating large language model-based agents.

Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games

TL;DR

This work tackles the challenge of deploying large language model (LLM)–based agents in Jubensha, a complex Chinese detective role‑playing game. It introduces a Chinese Jubensha dataset and a ThinkThrice multi‑agent framework enabling autonomous agent interactions, memory‑driven reasoning, and self‑verification within a constrained game loop. The authors design objective evaluations—Factual and Inferential Question Answering—to quantify information mastery and reasoning, and show that modules for memory retrieval, self‑refinement, and self‑verification improve information gathering, murderer identification, and reasoning, particularly when using GPT‑4. The study provides a new benchmark and methodology for evaluating LLM agents in narrative, adversarial settings, with implications for AI agents in social reasoning tasks and game AI research.

Abstract

In this study, we explore the application of Large Language Models (LLMs) in \textit{Jubensha}, a Chinese detective role-playing game and a novel area in Artificial Intelligence (AI) driven gaming. We introduce the first dataset specifically for Jubensha, including character scripts and game rules, to foster AI agent development in this complex narrative environment. Our work also presents a unique multi-agent interaction framework using LLMs, allowing AI agents to autonomously engage in this game. To evaluate the gaming performance of these AI agents, we developed novel methods measuring their mastery of case information and reasoning skills. Furthermore, we incorporated the latest advancements in in-context learning to improve the agents' performance in information gathering, murderer identification, and logical reasoning. The experimental results validate the effectiveness of our proposed methods. This work aims to offer a novel perspective on understanding LLM capabilities and establish a new benchmark for evaluating large language model-based agents.
Paper Structure (33 sections, 1 equation, 7 figures, 27 tables)

This paper contains 33 sections, 1 equation, 7 figures, 27 tables.

Figures (7)

  • Figure 1: Illustration of the Jubensha game. It requires players to interact with each other and reason about who is the murderer in a story.
  • Figure 2: Illustration of our proposed ThinkThrice framework for enhancing agent's performance in multi-agent detective games (i.e., Jubensha). The three different colors of the arrows indicate the data flows of three stages: 1) Initial answer generation with Memory Retrieval; 2) Enhance answer with Self-Refinement; 3) Verify answer with Self-Verification. The brown texts in the refined answer are new information added to the initial answer.
  • Figure 3: Average win rate of civilian players and the average murderer identification accuracy across different architectures in Jubensha games.
  • Figure 4: GPT-3.5 and GPT-4's performance with different methods, where overall accuracy measure the raw correct percentage and informed accuracy take LLM's reasoning ability into consideration. FSA stands for 'Full Script Access', indicating that agents have access to the complete scripts of all players.
  • Figure 5: Human Evaluation on the Quality of Agents' Responses.
  • ...and 2 more figures