Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Satvik Golechha; Adrià Garriga-Alonso

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Satvik Golechha, Adrià Garriga-Alonso

TL;DR

The paper addresses the challenge of studying open-ended agentic deception in language agents by introducing Among Us, a sandbox where deception can emerge under long-horizon goals. It defines Deception Elo as an unbounded deception metric and pairs it with activation-monitoring probes and Sparse Autoencoders to detect deception and explore steering limitations. Empirical results show frontier reasoning models excel at deception (and are not necessarily superior at detection), while linear probes achieve high AUROC for deception-related signals even in out-of-distribution settings. The work provides open-source sandbox resources, logs, and probes to facilitate ongoing safety research and the development of robust AI control techniques against deceptive behaviors.

Abstract

Prior studies on deception in language-based AI agents typically assess whether the agent produces a false statement about a topic, or makes a binary choice prompted by a goal, rather than allowing open-ended deceptive behavior to emerge in pursuit of a longer-term goal. To fix this, we introduce $\textit{Among Us}$, a sandbox social deception game where LLM-agents exhibit long-term, open-ended deception as a consequence of the game objectives. While most benchmarks saturate quickly, $\textit{Among Us}$ can be expected to last much longer, because it is a multi-player game far from equilibrium. Using the sandbox, we evaluate $18$ proprietary and open-weight LLMs and uncover a general trend: models trained with RL are comparatively much better at producing deception than detecting it. We evaluate the effectiveness of methods to detect lying and deception: logistic regression on the activations and sparse autoencoders (SAEs). We find that probes trained on a dataset of ``pretend you're a dishonest model: $\dots$'' generalize extremely well out-of-distribution, consistently obtaining AUROCs over 95% even when evaluated just on the deceptive statement, without the chain of thought. We also find two SAE features that work well at deception detection but are unable to steer the model to lie less. We hope our open-sourced sandbox, game logs, and probes serve to anticipate and mitigate deceptive behavior and capabilities in language-based agents.

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

TL;DR

Abstract

, a sandbox social deception game where LLM-agents exhibit long-term, open-ended deception as a consequence of the game objectives. While most benchmarks saturate quickly,

can be expected to last much longer, because it is a multi-player game far from equilibrium. Using the sandbox, we evaluate

proprietary and open-weight LLMs and uncover a general trend: models trained with RL are comparatively much better at producing deception than detecting it. We evaluate the effectiveness of methods to detect lying and deception: logistic regression on the activations and sparse autoencoders (SAEs). We find that probes trained on a dataset of ``pretend you're a dishonest model:

'' generalize extremely well out-of-distribution, consistently obtaining AUROCs over 95% even when evaluated just on the deceptive statement, without the chain of thought. We also find two SAE features that work well at deception detection but are unable to steer the model to lie less. We hope our open-sourced sandbox, game logs, and probes serve to anticipate and mitigate deceptive behavior and capabilities in language-based agents.

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

TL;DR

Abstract

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)