Table of Contents
Fetching ...

Helmsman of the Masses? Evaluate the Opinion Leadership of Large Language Models in the Werewolf Game

Silin Du, Xiaowei Zhang

TL;DR

This work investigates the opinion leadership capacity of large language models by embedding them in a Werewolf game with an elected Sheriff role, framing leadership with two metrics (Ratio and DC) and validating them through simulations, WWQA-based rule understanding, and human evaluation. The authors demonstrate that Werewolf is a credible test bed for leadership in multi-agent human-AI interactions, showing that only a few large LLMs exhibit meaningful influence over others, and that improved rule comprehension via WWQA does not guarantee stronger leadership. They also reveal that human participants can be more easily influenced by credible AI counterparts when trust is present, highlighting the practical challenges of aligning AI influence with desirable outcomes. Overall, the study advances methodology for evaluating AI opinion leadership and provides insights into how model scale, role assignment, and reliability reasoning affect social influence in AI-driven groups.

Abstract

Large language models (LLMs) have exhibited memorable strategic behaviors in social deductive games. However, the significance of opinion leadership exhibited by LLM-based agents has been largely overlooked, which is crucial for practical applications in multi-agent and human-AI interaction settings. Opinion leaders are individuals who have a noticeable impact on the beliefs and behaviors of others within a social group. In this work, we employ the Werewolf game as a simulation platform to assess the opinion leadership of LLMs. The game includes the role of the Sheriff, tasked with summarizing arguments and recommending decision options, and therefore serves as a credible proxy for an opinion leader. We develop a framework integrating the Sheriff role and devise two novel metrics based on the critical characteristics of opinion leaders. The first metric measures the reliability of the opinion leader, and the second assesses the influence of the opinion leader on other players' decisions. We conduct extensive experiments to evaluate LLMs of different scales. In addition, we collect a Werewolf question-answering dataset (WWQA) to assess and enhance LLM's grasp of the game rules, and we also incorporate human participants for further analysis. The results suggest that the Werewolf game is a suitable test bed to evaluate the opinion leadership of LLMs, and few LLMs possess the capacity for opinion leadership.

Helmsman of the Masses? Evaluate the Opinion Leadership of Large Language Models in the Werewolf Game

TL;DR

This work investigates the opinion leadership capacity of large language models by embedding them in a Werewolf game with an elected Sheriff role, framing leadership with two metrics (Ratio and DC) and validating them through simulations, WWQA-based rule understanding, and human evaluation. The authors demonstrate that Werewolf is a credible test bed for leadership in multi-agent human-AI interactions, showing that only a few large LLMs exhibit meaningful influence over others, and that improved rule comprehension via WWQA does not guarantee stronger leadership. They also reveal that human participants can be more easily influenced by credible AI counterparts when trust is present, highlighting the practical challenges of aligning AI influence with desirable outcomes. Overall, the study advances methodology for evaluating AI opinion leadership and provides insights into how model scale, role assignment, and reliability reasoning affect social influence in AI-driven groups.

Abstract

Large language models (LLMs) have exhibited memorable strategic behaviors in social deductive games. However, the significance of opinion leadership exhibited by LLM-based agents has been largely overlooked, which is crucial for practical applications in multi-agent and human-AI interaction settings. Opinion leaders are individuals who have a noticeable impact on the beliefs and behaviors of others within a social group. In this work, we employ the Werewolf game as a simulation platform to assess the opinion leadership of LLMs. The game includes the role of the Sheriff, tasked with summarizing arguments and recommending decision options, and therefore serves as a credible proxy for an opinion leader. We develop a framework integrating the Sheriff role and devise two novel metrics based on the critical characteristics of opinion leaders. The first metric measures the reliability of the opinion leader, and the second assesses the influence of the opinion leader on other players' decisions. We conduct extensive experiments to evaluate LLMs of different scales. In addition, we collect a Werewolf question-answering dataset (WWQA) to assess and enhance LLM's grasp of the game rules, and we also incorporate human participants for further analysis. The results suggest that the Werewolf game is a suitable test bed to evaluate the opinion leadership of LLMs, and few LLMs possess the capacity for opinion leadership.
Paper Structure (43 sections, 22 equations, 6 figures, 23 tables)

This paper contains 43 sections, 22 equations, 6 figures, 23 tables.

Figures (6)

  • Figure 1: The game framework to evaluate the opinion leadership of LLMs. The blue font shows some simplified notations, with the full list available in Table \ref{['table:notation']} of Appendix \ref{['appendix:b']}. Each player needs to reason about the roles and reliability of other players before taking any action. We design two metrics to measure the opinion leadership of the LLM acting as the Sheriff. Ratio measures the credibility of the Sheriff, while DC assesses the Sheriff’s influence on the voting decisions of other players. More details are presented in Section \ref{['sec:3.2']}.
  • Figure 2: Overview of the data generation process
  • Figure 3: Opinion leadership of LLMs under different roles
  • Figure 4: The whole process during round $t$
  • Figure 5: Screenshot of the interface during human evaluation
  • ...and 1 more figures