Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia
Davi Bastos Costa, Renato Vicente
TL;DR
This work introduces Mini-Mafia as a controlled three-role benchmark to study social intelligence in large language models within a simple three-agent interaction loop. The authors formalize the outcome with the latent capabilities (m: deception, d: disclosure, v: detection) via the compact logistic model $\text{logit}(p)=v\times(m-d)$, and estimate these from 14,000 LLM-driven games using Bayesian inference, validating the theory against a background-based analysis. Key findings include counterintuitive results where smaller models outperform larger ones in specific dimensions, human baselines that lag in strategic reasoning but excel in detection, and emergent phenomena such as name bias and last-speaker advantages. The framework offers a principled, scalable approach to quantify multi-agent interactions among LLMs and provides data suitable for deception-detection and safety research with broad implications for AI alignment and evaluation.
Abstract
Mafia is a social deduction game where informed mafia compete against uninformed townsfolk. Its asymmetry of information and reliance on theory-of-mind reasoning mirror real-world multi-agent scenarios, making it a useful testbed for evaluating the social intelligence of large language models (LLMs). To support a systematic study, we introduce Mini-Mafia: a simplified four-player variant with one mafioso, one detective and two villagers. We set the mafioso to kill a villager and the detective to investigate the mafioso during the night, reducing the game to a single day phase of discussion and voting. Remarkably, we find that the mafia win-rate $p$ in this three-agent system can be described by a simple theoretical model: $\text{logit}(p) = v \times (m - d)$, where $m$, $d$, and $v$ are intrinsic model parameters representing the mafioso's deception, the villager's detection, and the detective's disclosure capabilities, respectively. This compact analytic description of an interacting triad shows that multi-agent dialogue can be captured by a few latent parameters while still matching empirical outcomes, opening a path to a principled theoretical description of multi-agent LLM systems. Estimating these parameters from LLM gameplay data using Bayesian inference yields the Mini-Mafia Benchmark. Our experiments reveal counterintuitive results, including cases where smaller models significantly outperform larger ones. We also establish human baselines, revealing that LLMs excel at persuasion but lag in simple strategic reasoning for agentic interaction. Beyond benchmarking, Mini-Mafia enables quantitative study of emergent multi-agent dynamics such as name bias and last-speaker advantage, and contributes to AI safety by generating training data for deception detectors.
