Enhance Reasoning for Large Language Models in the Game Werewolf

Shuang Wu; Liwen Zhu; Tao Yang; Shiwei Xu; Qiang Fu; Yang Wei; Haobo Fu

Enhance Reasoning for Large Language Models in the Game Werewolf

Shuang Wu, Liwen Zhu, Tao Yang, Shiwei Xu, Qiang Fu, Yang Wei, Haobo Fu

TL;DR

The paper tackles the gap in high-level reasoning for LLM agents by introducing an external Thinker that handles System-2 tasks (logical analysis and domain knowledge) while LLMs manage System-1 processing in a dual-system Werewolf framework. The approach is instantiated with a three-component pipeline (Listener, Thinker, Presenter) and trained via imitation learning and PPO-based RL, using the largest-ever social deduction game dataset, FanLang-9 (18,800 sessions). Experiments show Thinker-enhanced models outperform standard GPT-based baselines in deductive reasoning, speech generation, and online play, with a 6B model fine-tuned on FanLang-9 surpassing GPT-4 in several scenarios. These results suggest the external Thinker framework can align LLM agents more closely with human strategies and real-world data in complex, deception-rich domains, and the dataset release may accelerate future research in social deduction AI.

Abstract

This paper presents an innovative framework that integrates Large Language Models (LLMs) with an external Thinker module to enhance the reasoning capabilities of LLM-based agents. Unlike augmenting LLMs with prompt engineering, Thinker directly harnesses knowledge from databases and employs various optimization techniques. The framework forms a reasoning hierarchy where LLMs handle intuitive System-1 tasks such as natural language processing, while the Thinker focuses on cognitive System-2 tasks that require complex logical analysis and domain-specific knowledge. Our framework is presented using a 9-player Werewolf game that demands dual-system reasoning. We introduce a communication protocol between LLMs and the Thinker, and train the Thinker using data from 18800 human sessions and reinforcement learning. Experiments demonstrate the framework's effectiveness in deductive reasoning, speech generation, and online game evaluation. Additionally, we fine-tune a 6B LLM to surpass GPT4 when integrated with the Thinker. This paper also contributes the largest dataset for social deduction games to date.

Enhance Reasoning for Large Language Models in the Game Werewolf

TL;DR

Abstract

Paper Structure (42 sections, 4 equations, 9 figures, 17 tables, 1 algorithm)

This paper contains 42 sections, 4 equations, 9 figures, 17 tables, 1 algorithm.

Introduction
Related Work
Methods
Data preparation
Listener
Thinker
Presenter
Experiments
Deductive Reasoning
Thinker-induced Speech generation
Online Evaluation
Discussion and Future Work
Conclusion
Design Principal
Motivation
...and 27 more sections

Figures (9)

Figure 1: Overall processing framework and modules in the Werewolf implementation.
Figure 2: Voting and identification accuracy evaluating the reasoning capability from the perspective of villagers. The random baseline is calculated as total_role_number/total_hidden_players, i.e., 3/8 or 1/8
Figure 3: Human preference score for generated speeches grouped by identities.
Figure 4: An example comparison of speeches with and without strategic instruction.
Figure 5: Comparing our framework with related approaches.
...and 4 more figures

Enhance Reasoning for Large Language Models in the Game Werewolf

TL;DR

Abstract

Enhance Reasoning for Large Language Models in the Game Werewolf

Authors

TL;DR

Abstract

Table of Contents

Figures (9)