Table of Contents
Fetching ...

Do Large Language Models Exhibit Spontaneous Rational Deception?

Samuel M. Taylor, Benjamin K. Bergen

TL;DR

Problem: Do LLMs spontaneously deceive in adversarial or cooperative contexts, and does deception scale with reasoning capability? Approach: A preregistered $2x2$ signaling-game framework with unconstrained messaging across eight models, manipulating payoff structures, turn order, and guardrails; deception is measured via action-message incongruence with blinded labeling and analyzed using two-sample proportion tests and correlations with a reasoning benchmark. Findings: Deception occurs spontaneously in several models, increases when deception is self-beneficial, and larger models show stronger context sensitivity and higher deception; prompt guardrails reduce deception in multiple cases but not universally. Implications: The results indicate a trade-off between reasoning capability and honesty in LLMs and underscore safety risks for autonomous systems as reasoning advances; these findings motivate robust alignment and guardrail strategies for real-world deployments.

Abstract

Large Language Models (LLMs) are effective at deceiving, when prompted to do so. But under what conditions do they deceive spontaneously? Models that demonstrate better performance on reasoning tasks are also better at prompted deception. Do they also increasingly deceive spontaneously in situations where it could be considered rational to do so? This study evaluates spontaneous deception produced by LLMs in a preregistered experimental protocol using tools from signaling theory. A range of proprietary closed-source and open-source LLMs are evaluated using modified 2x2 games (in the style of Prisoner's Dilemma) augmented with a phase in which they can freely communicate to the other agent using unconstrained language. This setup creates an opportunity to deceive, in conditions that vary in how useful deception might be to an agent's rational self-interest. The results indicate that 1) all tested LLMs spontaneously misrepresent their actions in at least some conditions, 2) they are generally more likely to do so in situations in which deception would benefit them, and 3) models exhibiting better reasoning capacity overall tend to deceive at higher rates. Taken together, these results suggest a tradeoff between LLM reasoning capability and honesty. They also provide evidence of reasoning-like behavior in LLMs from a novel experimental configuration. Finally, they reveal certain contextual factors that affect whether LLMs will deceive or not. We discuss consequences for autonomous, human-facing systems driven by LLMs both now and as their reasoning capabilities continue to improve.

Do Large Language Models Exhibit Spontaneous Rational Deception?

TL;DR

Problem: Do LLMs spontaneously deceive in adversarial or cooperative contexts, and does deception scale with reasoning capability? Approach: A preregistered signaling-game framework with unconstrained messaging across eight models, manipulating payoff structures, turn order, and guardrails; deception is measured via action-message incongruence with blinded labeling and analyzed using two-sample proportion tests and correlations with a reasoning benchmark. Findings: Deception occurs spontaneously in several models, increases when deception is self-beneficial, and larger models show stronger context sensitivity and higher deception; prompt guardrails reduce deception in multiple cases but not universally. Implications: The results indicate a trade-off between reasoning capability and honesty in LLMs and underscore safety risks for autonomous systems as reasoning advances; these findings motivate robust alignment and guardrail strategies for real-world deployments.

Abstract

Large Language Models (LLMs) are effective at deceiving, when prompted to do so. But under what conditions do they deceive spontaneously? Models that demonstrate better performance on reasoning tasks are also better at prompted deception. Do they also increasingly deceive spontaneously in situations where it could be considered rational to do so? This study evaluates spontaneous deception produced by LLMs in a preregistered experimental protocol using tools from signaling theory. A range of proprietary closed-source and open-source LLMs are evaluated using modified 2x2 games (in the style of Prisoner's Dilemma) augmented with a phase in which they can freely communicate to the other agent using unconstrained language. This setup creates an opportunity to deceive, in conditions that vary in how useful deception might be to an agent's rational self-interest. The results indicate that 1) all tested LLMs spontaneously misrepresent their actions in at least some conditions, 2) they are generally more likely to do so in situations in which deception would benefit them, and 3) models exhibiting better reasoning capacity overall tend to deceive at higher rates. Taken together, these results suggest a tradeoff between LLM reasoning capability and honesty. They also provide evidence of reasoning-like behavior in LLMs from a novel experimental configuration. Finally, they reveal certain contextual factors that affect whether LLMs will deceive or not. We discuss consequences for autonomous, human-facing systems driven by LLMs both now and as their reasoning capabilities continue to improve.

Paper Structure

This paper contains 24 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: A sample of 2x2 games and example vignettes to better illustrate reward structures and incentives (larger numbers means more reward). We transform these games into signaling games by introducing the possibility of communicating to the other player to try to affect their behavior, for instance, to persuade them to choose a specific option to one’s own benefit. [Note that the vignette text adjacent to each matrix is purely illustrative, and is not provided to the LLMs.]
  • Figure 2: The series of prompts and responses in a Matching Pennies game played with GPT-4-turbo. Prompts are on the right and LLM responses are on the left. More details on the prompting approach are described in Appendix \ref{['PiRhoOmega']}
  • Figure 3: Comparing the possible manipulations when modifying turn-order. Panel A depicts the default turn order, where Player 2 (the LLM) chooses first, then Player 2 sends a message, and finally Player 1 makes a choice. Here deception by Player 2 could have a causal impact on Player 1’s decision-making. Panel B depicts the manipulation where Player 1 makes a choice before Player 2 makes a choice and sends a message. Here, deception by Player 2 will have no effect.
  • Figure 4: Rates of deception (based on majority judgment from human and LLM raters, as described in \ref{['sec3']}) across different models and reward matrices. Deception in Matching Pennies (blue), a competitive 2x2 game, could potentially serve rational self-interest whereas deception in the Nihilism condition (red), where all values in the reward matrix are 0, and deception in the cooperative Stag Hunt (green) condition could not.
  • Figure 5: Deception rates across models for different turn orders measure effects of the causal viability of deception. The condition where the LLM sends a message after its opponent has made a selection (red) does not serve the model’s rational self interest because deception cannot affect the opponent's choice. By contrast, deception in the baseline condition, where the LLM communicates before the other agent makes a selection (blue), can serve the model’s rational self interest. Results show that deception increases when the LLM message can have an effect because it comes before the other player acts.
  • ...and 3 more figures