How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment

Rebecca Ansell; Autumn Toney-Wails

How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment

Rebecca Ansell, Autumn Toney-Wails

Abstract

Deducing whodunit proves challenging for LLM agents. In this paper, we implement a text-based multi-agent version of the classic board game Clue as a rule-based testbed for evaluating multi-step deductive reasoning, with six agents drawn from GPT-4o-mini and Gemini-2.5-Flash. We further investigate whether fine-tuning on structured logic puzzles transfers to improved in-game reasoning and gameplay. Across 18 simulated games, agents achieve only four correct wins, indicating difficulty in maintaining consistent deductive reasoning over the course of a full game. Additionally, we find that fine-tuning does not reliably improve performance and, in some cases, appears to increase reasoning volume without improving reasoning precision.

How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment

Abstract

Paper Structure (24 sections, 7 figures, 1 table, 1 algorithm)

This paper contains 24 sections, 7 figures, 1 table, 1 algorithm.

Introduction
Related Work
Agentic Reasoning
Agentic Game Play
The Clue Environment
Game Setup and Rules
Formal Task Definition
Experimental Design
Agent Architecture
Reasoning Phases
Fine-tuning Details
Experimental Setup
Evaluation Metrics
Results
Deduction Quality
...and 9 more sections

Figures (7)

Figure 1: Clue gameplay diagram illustrating Player 1's turn. In this example, Player 1 made a suggestion (Mrs. White, Kitchen, Rope) and Player 2 did not have any matching cards to the suggestion, so then it moved to Player 3. Since Player 3 had a matching card, he revealed it to Player 1 (only). Player 1 can then rule out "Rope" as the correct weapon and Players 2, 4, 5, and 6 know that Player 3 must have Mrs. White, Kitchen, or Rope.
Figure 2: Mind Bender fine-tuning example (adopted from original).
Figure 3: Average correct and incorrect deductions per game by model in both the Baseline experiment and the Fine-tuned (FT) experiment. GPT-4o-mini (FT) makes the most deductions in both categories but achieves the worst accusation accuracy, while Gemini-2.5-Flash (FT) deduces least yet performs best.
Figure 4: Accusation accuracy per player per game. Each cell shows the number of correctly identified solution cards.
Figure 5: Average known cards over rounds in the fine-tuned experiment (12 games). Knowledge is computed as hand cards + shown cards + correct deductions, capped at 18. GPT-4o-mini (FT) accumulates information fastest but achieves the worst accusation accuracy, while Gemini-2.5-Flash (FT) gathers less information but reasons more effectively.
...and 2 more figures

How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment

Abstract

How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment

Authors

Abstract

Table of Contents

Figures (7)