Table of Contents
Fetching ...

Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies

Zirui Song, Yuan Huang, Junchang Liu, Haozhe Luo, Chenxi Wang, Lang Gao, Zixiang Xu, Mingfei Han, Xiaojun Chang, Xiuying Chen

TL;DR

The paper addresses the gap between self-play LLM evaluations in social deduction games and authentic human gameplay by introducing WereBench, a high-quality multimodal dataset derived from televised Werewolf matches, and WereAlign, a strategy-alignment evaluation framework. The two-stage evaluation probes both speech quality (through ground-truth, MVP-aligned multiple-choice questions across five social dimensions) and decisions (via vote alignment and opponent-role inference against human strategies). Empirical results show substantial variation across state-of-the-art models, with about half scoring below 0.50 in speech evaluation, indicating significant gaps in deception and counterfactual reasoning, and reveal that larger models generally outperform smaller ones but still struggle with strategic nuance. The work contributes a robust dataset, a principled evaluation paradigm, and empirical evidence that current LLMs are linguistically fluent but not yet strategically competent in complex multi-agent social interactions, suggesting a clear direction for future improvements in language, reasoning, and social intelligence. The framework and dataset are poised to influence future research on language-driven strategic behavior in AI, with practical implications for benchmarking social reasoning in multi-agent systems.

Abstract

Social deduction games like Werewolf combine language, reasoning, and strategy, providing a testbed for studying natural language and social intelligence. However, most studies reduce the game to LLM-based self-play, yielding templated utterances and anecdotal cases that overlook the richness of social gameplay. Evaluation further relies on coarse metrics such as survival time or subjective scoring due to the lack of quality reference data. To address these gaps, we curate a high-quality, human-verified multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants. Based on this dataset, we propose a novel strategy-alignment evaluation that leverages the winning faction's strategies as ground truth in two stages: 1) Speech evaluation, formulated as multiple-choice-style tasks that assess whether the model can adopt appropriate stances across five dimensions of social ability; and 2) Decision evaluation, which assesses the model's voting choices and opponent-role inferences. This framework enables a fine-grained evaluation of models' linguistic and reasoning capabilities, while capturing their ability to generate strategically coherent gameplay. Our experiments show that state-of-the-art LLMs show diverse performance, with roughly half remain below 0.50, revealing clear gaps in deception and counterfactual reasoning. We hope our dataset further inspires research on language, reasoning, and strategy in multi-agent interaction.

Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies

TL;DR

The paper addresses the gap between self-play LLM evaluations in social deduction games and authentic human gameplay by introducing WereBench, a high-quality multimodal dataset derived from televised Werewolf matches, and WereAlign, a strategy-alignment evaluation framework. The two-stage evaluation probes both speech quality (through ground-truth, MVP-aligned multiple-choice questions across five social dimensions) and decisions (via vote alignment and opponent-role inference against human strategies). Empirical results show substantial variation across state-of-the-art models, with about half scoring below 0.50 in speech evaluation, indicating significant gaps in deception and counterfactual reasoning, and reveal that larger models generally outperform smaller ones but still struggle with strategic nuance. The work contributes a robust dataset, a principled evaluation paradigm, and empirical evidence that current LLMs are linguistically fluent but not yet strategically competent in complex multi-agent social interactions, suggesting a clear direction for future improvements in language, reasoning, and social intelligence. The framework and dataset are poised to influence future research on language-driven strategic behavior in AI, with practical implications for benchmarking social reasoning in multi-agent systems.

Abstract

Social deduction games like Werewolf combine language, reasoning, and strategy, providing a testbed for studying natural language and social intelligence. However, most studies reduce the game to LLM-based self-play, yielding templated utterances and anecdotal cases that overlook the richness of social gameplay. Evaluation further relies on coarse metrics such as survival time or subjective scoring due to the lack of quality reference data. To address these gaps, we curate a high-quality, human-verified multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants. Based on this dataset, we propose a novel strategy-alignment evaluation that leverages the winning faction's strategies as ground truth in two stages: 1) Speech evaluation, formulated as multiple-choice-style tasks that assess whether the model can adopt appropriate stances across five dimensions of social ability; and 2) Decision evaluation, which assesses the model's voting choices and opponent-role inferences. This framework enables a fine-grained evaluation of models' linguistic and reasoning capabilities, while capturing their ability to generate strategically coherent gameplay. Our experiments show that state-of-the-art LLMs show diverse performance, with roughly half remain below 0.50, revealing clear gaps in deception and counterfactual reasoning. We hope our dataset further inspires research on language, reasoning, and strategy in multi-agent interaction.

Paper Structure

This paper contains 28 sections, 1 equation, 32 figures, 4 tables.

Figures (32)

  • Figure 1: Limitations of prior LLM-based play in the Werewolf game. (a) Generated speeches are often shallow and lack informative content. (b) Even state-of-the-art models fail to fully capture the game rules, with anecdotal failure cases. (c) In contrast, our WereBench dataset, combined with the WereAlign evaluation framework, enables assessment of models with human-aligned strategies, capturing both speech quality and decision-making accuracy.
  • Figure 2: Role composition in WereBench.
  • Figure 3: An overview of our WereBench dataset. Each data sample provides the view of a complete game video, with the human annotation including: (a) role introduction, (b) role allocation, (c) rules, (d) speech with timestamp, (e) logs like votes and skill usage; and (f) highlight annotations (g) summary with the expert's post-game analysis.
  • Figure 4: Example item from the WereAlign speech evaluation in Strategic Judgment. The context consists of the speech history and public game logs, followed by a question, candidate options, and explanations. $[\mathcal{M}_i]$ represent the generation mechanisms.
  • Figure 5: Role‑conditioned performance on WereBench: in the decision task, LLMs are strongest as Witch, whereas in the speech task they perform best as cue‑rich roles such as Werewolf and Seer.
  • ...and 27 more figures