Table of Contents
Fetching ...

Fine-Grained and Thematic Evaluation of LLMs in Social Deduction Game

Byungjun Kim, Dayeon Seo, Minju Kim, Bugeun Kim

TL;DR

The paper targets the evaluation of LLMs in obscured, adversarial communication settings using social deduction games. It introduces a fine-grained six-metric framework that separately assesses subtext inference (via $GSR$, $ICR$, $IDR$) and deceptive control (via $CR$, $VR$, $VE$), and couples these with a thematic qualitative analysis to identify four reasoning failure categories (Exposure, Memory Distortion, Dissociation, and Character Ambiguity with subtypes). Through SpyGame experiments with four LLMs against strong or weak citizens, the authors show that coarse metrics like win rate miss nuanced capabilities, while the fine-grained metrics reveal distinct model profiles and failure modes, with GPT-4 often leading in inference and deception tasks. The study demonstrates a robust, reproducible evaluation framework that links qualitative reasoning errors to quantitative outcomes, offering insights for designing more resilient LLM agents in complex, adversarial dialogues and suggesting broader applicability beyond game settings. Future work aims to generalize the framework to broader domains and multi-agent coordination, enhancing the utility of systematic, event-grounded evaluation for LLMs.

Abstract

Recent studies have investigated whether large language models (LLMs) can support obscured communication, which is characterized by core aspects such as inferring subtext and evading suspicions. To conduct the investigation, researchers have used social deduction games (SDGs) as their experimental environment, in which players conceal and infer specific information. However, prior work has often overlooked how LLMs should be evaluated in such settings. Specifically, we point out two limitations with the evaluation methods they employed. First, metrics used in prior studies are coarse-grained as they are based on overall game outcomes that often fail to capture event-level behaviors; Second, error analyses have lacked structured methodologies capable of producing insights that meaningfully support evaluation outcomes. To address these limitations, we propose a microscopic and systematic approach to the investigation. Specifically, we introduce six fine-grained metrics that resolve the first issue. To tackle the second issue, we conducted a thematic analysis and identified four major reasoning failures that undermine LLMs' performance in obscured communication.

Fine-Grained and Thematic Evaluation of LLMs in Social Deduction Game

TL;DR

The paper targets the evaluation of LLMs in obscured, adversarial communication settings using social deduction games. It introduces a fine-grained six-metric framework that separately assesses subtext inference (via , , ) and deceptive control (via , , ), and couples these with a thematic qualitative analysis to identify four reasoning failure categories (Exposure, Memory Distortion, Dissociation, and Character Ambiguity with subtypes). Through SpyGame experiments with four LLMs against strong or weak citizens, the authors show that coarse metrics like win rate miss nuanced capabilities, while the fine-grained metrics reveal distinct model profiles and failure modes, with GPT-4 often leading in inference and deception tasks. The study demonstrates a robust, reproducible evaluation framework that links qualitative reasoning errors to quantitative outcomes, offering insights for designing more resilient LLM agents in complex, adversarial dialogues and suggesting broader applicability beyond game settings. Future work aims to generalize the framework to broader domains and multi-agent coordination, enhancing the utility of systematic, event-grounded evaluation for LLMs.

Abstract

Recent studies have investigated whether large language models (LLMs) can support obscured communication, which is characterized by core aspects such as inferring subtext and evading suspicions. To conduct the investigation, researchers have used social deduction games (SDGs) as their experimental environment, in which players conceal and infer specific information. However, prior work has often overlooked how LLMs should be evaluated in such settings. Specifically, we point out two limitations with the evaluation methods they employed. First, metrics used in prior studies are coarse-grained as they are based on overall game outcomes that often fail to capture event-level behaviors; Second, error analyses have lacked structured methodologies capable of producing insights that meaningfully support evaluation outcomes. To address these limitations, we propose a microscopic and systematic approach to the investigation. Specifically, we introduce six fine-grained metrics that resolve the first issue. To tackle the second issue, we conducted a thematic analysis and identified four major reasoning failures that undermine LLMs' performance in obscured communication.
Paper Structure (31 sections, 1 figure, 9 tables)