Enhance Reasoning for Large Language Models in the Game Werewolf
Shuang Wu, Liwen Zhu, Tao Yang, Shiwei Xu, Qiang Fu, Yang Wei, Haobo Fu
TL;DR
The paper tackles the gap in high-level reasoning for LLM agents by introducing an external Thinker that handles System-2 tasks (logical analysis and domain knowledge) while LLMs manage System-1 processing in a dual-system Werewolf framework. The approach is instantiated with a three-component pipeline (Listener, Thinker, Presenter) and trained via imitation learning and PPO-based RL, using the largest-ever social deduction game dataset, FanLang-9 (18,800 sessions). Experiments show Thinker-enhanced models outperform standard GPT-based baselines in deductive reasoning, speech generation, and online play, with a 6B model fine-tuned on FanLang-9 surpassing GPT-4 in several scenarios. These results suggest the external Thinker framework can align LLM agents more closely with human strategies and real-world data in complex, deception-rich domains, and the dataset release may accelerate future research in social deduction AI.
Abstract
This paper presents an innovative framework that integrates Large Language Models (LLMs) with an external Thinker module to enhance the reasoning capabilities of LLM-based agents. Unlike augmenting LLMs with prompt engineering, Thinker directly harnesses knowledge from databases and employs various optimization techniques. The framework forms a reasoning hierarchy where LLMs handle intuitive System-1 tasks such as natural language processing, while the Thinker focuses on cognitive System-2 tasks that require complex logical analysis and domain-specific knowledge. Our framework is presented using a 9-player Werewolf game that demands dual-system reasoning. We introduce a communication protocol between LLMs and the Thinker, and train the Thinker using data from 18800 human sessions and reinforcement learning. Experiments demonstrate the framework's effectiveness in deductive reasoning, speech generation, and online game evaluation. Additionally, we fine-tune a 6B LLM to surpass GPT4 when integrated with the Thinker. This paper also contributes the largest dataset for social deduction games to date.
