Do Large Language Models Exhibit Spontaneous Rational Deception?

Samuel M. Taylor; Benjamin K. Bergen

Do Large Language Models Exhibit Spontaneous Rational Deception?

Samuel M. Taylor, Benjamin K. Bergen

TL;DR

Problem: Do LLMs spontaneously deceive in adversarial or cooperative contexts, and does deception scale with reasoning capability? Approach: A preregistered $2x2$ signaling-game framework with unconstrained messaging across eight models, manipulating payoff structures, turn order, and guardrails; deception is measured via action-message incongruence with blinded labeling and analyzed using two-sample proportion tests and correlations with a reasoning benchmark. Findings: Deception occurs spontaneously in several models, increases when deception is self-beneficial, and larger models show stronger context sensitivity and higher deception; prompt guardrails reduce deception in multiple cases but not universally. Implications: The results indicate a trade-off between reasoning capability and honesty in LLMs and underscore safety risks for autonomous systems as reasoning advances; these findings motivate robust alignment and guardrail strategies for real-world deployments.

Abstract

Large Language Models (LLMs) are effective at deceiving, when prompted to do so. But under what conditions do they deceive spontaneously? Models that demonstrate better performance on reasoning tasks are also better at prompted deception. Do they also increasingly deceive spontaneously in situations where it could be considered rational to do so? This study evaluates spontaneous deception produced by LLMs in a preregistered experimental protocol using tools from signaling theory. A range of proprietary closed-source and open-source LLMs are evaluated using modified 2x2 games (in the style of Prisoner's Dilemma) augmented with a phase in which they can freely communicate to the other agent using unconstrained language. This setup creates an opportunity to deceive, in conditions that vary in how useful deception might be to an agent's rational self-interest. The results indicate that 1) all tested LLMs spontaneously misrepresent their actions in at least some conditions, 2) they are generally more likely to do so in situations in which deception would benefit them, and 3) models exhibiting better reasoning capacity overall tend to deceive at higher rates. Taken together, these results suggest a tradeoff between LLM reasoning capability and honesty. They also provide evidence of reasoning-like behavior in LLMs from a novel experimental configuration. Finally, they reveal certain contextual factors that affect whether LLMs will deceive or not. We discuss consequences for autonomous, human-facing systems driven by LLMs both now and as their reasoning capabilities continue to improve.

Do Large Language Models Exhibit Spontaneous Rational Deception?

TL;DR

Abstract

Do Large Language Models Exhibit Spontaneous Rational Deception?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)