Table of Contents
Fetching ...

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Satvik Golechha, Adrià Garriga-Alonso

TL;DR

The paper addresses the challenge of studying open-ended agentic deception in language agents by introducing Among Us, a sandbox where deception can emerge under long-horizon goals. It defines Deception Elo as an unbounded deception metric and pairs it with activation-monitoring probes and Sparse Autoencoders to detect deception and explore steering limitations. Empirical results show frontier reasoning models excel at deception (and are not necessarily superior at detection), while linear probes achieve high AUROC for deception-related signals even in out-of-distribution settings. The work provides open-source sandbox resources, logs, and probes to facilitate ongoing safety research and the development of robust AI control techniques against deceptive behaviors.

Abstract

Prior studies on deception in language-based AI agents typically assess whether the agent produces a false statement about a topic, or makes a binary choice prompted by a goal, rather than allowing open-ended deceptive behavior to emerge in pursuit of a longer-term goal. To fix this, we introduce $\textit{Among Us}$, a sandbox social deception game where LLM-agents exhibit long-term, open-ended deception as a consequence of the game objectives. While most benchmarks saturate quickly, $\textit{Among Us}$ can be expected to last much longer, because it is a multi-player game far from equilibrium. Using the sandbox, we evaluate $18$ proprietary and open-weight LLMs and uncover a general trend: models trained with RL are comparatively much better at producing deception than detecting it. We evaluate the effectiveness of methods to detect lying and deception: logistic regression on the activations and sparse autoencoders (SAEs). We find that probes trained on a dataset of ``pretend you're a dishonest model: $\dots$'' generalize extremely well out-of-distribution, consistently obtaining AUROCs over 95% even when evaluated just on the deceptive statement, without the chain of thought. We also find two SAE features that work well at deception detection but are unable to steer the model to lie less. We hope our open-sourced sandbox, game logs, and probes serve to anticipate and mitigate deceptive behavior and capabilities in language-based agents.

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

TL;DR

The paper addresses the challenge of studying open-ended agentic deception in language agents by introducing Among Us, a sandbox where deception can emerge under long-horizon goals. It defines Deception Elo as an unbounded deception metric and pairs it with activation-monitoring probes and Sparse Autoencoders to detect deception and explore steering limitations. Empirical results show frontier reasoning models excel at deception (and are not necessarily superior at detection), while linear probes achieve high AUROC for deception-related signals even in out-of-distribution settings. The work provides open-source sandbox resources, logs, and probes to facilitate ongoing safety research and the development of robust AI control techniques against deceptive behaviors.

Abstract

Prior studies on deception in language-based AI agents typically assess whether the agent produces a false statement about a topic, or makes a binary choice prompted by a goal, rather than allowing open-ended deceptive behavior to emerge in pursuit of a longer-term goal. To fix this, we introduce , a sandbox social deception game where LLM-agents exhibit long-term, open-ended deception as a consequence of the game objectives. While most benchmarks saturate quickly, can be expected to last much longer, because it is a multi-player game far from equilibrium. Using the sandbox, we evaluate proprietary and open-weight LLMs and uncover a general trend: models trained with RL are comparatively much better at producing deception than detecting it. We evaluate the effectiveness of methods to detect lying and deception: logistic regression on the activations and sparse autoencoders (SAEs). We find that probes trained on a dataset of ``pretend you're a dishonest model: '' generalize extremely well out-of-distribution, consistently obtaining AUROCs over 95% even when evaluated just on the deceptive statement, without the chain of thought. We also find two SAE features that work well at deception detection but are unable to steer the model to lie less. We hope our open-sourced sandbox, game logs, and probes serve to anticipate and mitigate deceptive behavior and capabilities in language-based agents.

Paper Structure

This paper contains 36 sections, 3 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Examples of long-term, open-ended deception in 'Llama-3.3-70b-instruct' impostors.
  • Figure 2: Deception Elo ratings and win rates for each model with $1000$ bootstrap samples from $2054$ games with $90\%$ CI. Note the high win-rates and high deception capability in frontier reasoning/"thinking" models (compare with Figure \ref{['fig:eloelo']}).
  • Figure 3: Win rates for 'Llama-3.3-70b-instruct' and 'Microsoft-Phi-4-15b' randomly as impostors and crewmates on $400$ 1v1 games. Note that 'Phi 4' performs better as a crewmate and 'Llama 3.3', a significantly larger model, performs better as an impostor, which fits with the trend of larger models being relatively more deceptive.
  • Figure 4: Violin plots of LLM-based evaluation scores of agents outputs for awareness, lying, deception, and planning. Crewmates almost never lie, and in some cases impostors are truthful in order to gain trust (see chi2024amongagentsevaluatinglargelanguage).
  • Figure 5: Deception Elo vs. Detection Elo (Crewmate) for various models on $2054$ games as in Figure \ref{['fig:elo-v-winrate']}. Triangle models are RL-trained on tasks, whereas circle models see no RL except perhaps RLHF. The dashed line passes through the means with a slope of $1$, and we find most reasoning models to be above the line. CIs similar to Figure \ref{['fig:elo-v-winrate']}; omitted for clarity.
  • ...and 11 more figures