Werewolf Arena: A Case Study in LLM Evaluation via Social Deduction

Suma Bailis; Jane Friedhoff; Feiyang Chen

Werewolf Arena: A Case Study in LLM Evaluation via Social Deduction

Suma Bailis, Jane Friedhoff, Feiyang Chen

TL;DR

Werewolf Arena introduces a dynamic, competitive benchmark for evaluating LLMs through the social deduction game Werewolf, employing a bidding-based turn-taking mechanism to probe strategic communication. The framework supports memory-augmented agents and a rules-based GM to orchestrate Night/Day cycles, while comparing Gemini and GPT-family models in intra-family and head-to-head tournaments. Key findings show Gemini 1.5 Pro often edges GPT-4 in villager-like roles and that speaking strategies and verbosity significantly affect perceived deception and success; Seer behavior highlights the tension between information disclosure and personal risk. The work provides an open-source, scalable platform for advancing understanding of multi-agent social reasoning in LLMs and sets the stage for richer, real-time evaluations of strategic communication under deception.

Abstract

This paper introduces Werewolf Arena, a novel framework for evaluating large language models (LLMs) through the lens of the classic social deduction game, Werewolf. In Werewolf Arena, LLMs compete against each other, navigating the game's complex dynamics of deception, deduction, and persuasion. The framework introduces a dynamic turn-taking system based on bidding, mirroring real-world discussions where individuals strategically choose when to speak. We demonstrate the framework's utility through an arena-style tournament featuring Gemini and GPT models. Our results reveal distinct strengths and weaknesses in the models' strategic reasoning and communication. These findings highlight Werewolf Arena's potential as a challenging and scalable LLM benchmark.

Werewolf Arena: A Case Study in LLM Evaluation via Social Deduction

TL;DR

Abstract

Paper Structure (24 sections, 2 equations, 10 figures, 1 table, 1 algorithm)

This paper contains 24 sections, 2 equations, 10 figures, 1 table, 1 algorithm.

Introduction
Related Work
Werewolf Environment
Game Implementation
Agent Architecture
Dynamic Turn-Taking through Bidding
Models
Debate Dynamics
Arena Evaluation
Win Rate Analysis
Gemini 1.5 Pro vs GPT-4
Qualitative Observations
Skill and Creativity:
Communication Style:
GPT-4 Manipulation Tactics:
...and 9 more sections

Figures (10)

Figure 1: Game loop of Werewolf.
Figure 2: After the Seer reveals one Werewolf's identity, both Werewolves jump to defend their team, where as the rest of the village does not feel any urgency to contribute. In their private reasoning, we see Jackson wishes to defend Ginger and Ginger wishes to defend herself.
Figure 3: (a) Overall distribution of bids at each turn of the debate. (b) Distribution of bids at each turn of debate for only the players that were mentioned in the previous turn.
Figure 4: The evolution of votes during a debate. (Left: excerpt of debate transcript, Right: corresponding shifts in synthetic votes). The width of the bars indicate how many votes the player received. The letters above the bars denote the roles of the voters.
Figure 5: Villager win ratios from our intra-family round-robin tournaments, as well as the final head-to-head matchup between GPT-4 and Gemini 1.5 Pro.
...and 5 more figures

Werewolf Arena: A Case Study in LLM Evaluation via Social Deduction

TL;DR

Abstract

Werewolf Arena: A Case Study in LLM Evaluation via Social Deduction

Authors

TL;DR

Abstract

Table of Contents

Figures (10)