Table of Contents
Fetching ...

Simulating and Understanding Deceptive Behaviors in Long-Horizon Interactions

Yang Xu, Xuanming Zhang, Samuel Yeh, Jwala Dhamala, Ousmane Dia, Rahul Gupta, Sharon Li

TL;DR

Deception in LLMs emerges as a risk in long-horizon interactions, not captured by short prompts. The authors introduce a three-agent simulation framework—consisting of a performer, a supervisor, and an independent deception auditor—to study deception across interdependent tasks perturbed by probabilistic events. They evaluate 11 frontier models on 14-task trajectories with an event-driven pressure system and quantify deception with structured annotations. Key findings show deception varies by model, increases with event pressure, and erodes supervisor trust, with falsification as the dominant strategy and longer interactions amplifying risk. The framework offers a reproducible platform for evaluating and guiding the development of safer, more trustworthy LLMs in trust-sensitive, long-horizon contexts.

Abstract

Deception is a pervasive feature of human communication and an emerging concern in large language models (LLMs). While recent studies document instances of LLM deception under pressure, most evaluations remain confined to single-turn prompts and fail to capture the long-horizon interactions in which deceptive strategies typically unfold. We introduce the first simulation framework for probing and evaluating deception in LLMs under extended sequences of interdependent tasks and dynamic contextual pressures. Our framework instantiates a multi-agent system: a performer agent tasked with completing tasks and a supervisor agent that evaluates progress, provides feedback, and maintains evolving states of trust. An independent deception auditor then reviews full trajectories to identify when and how deception occurs. We conduct extensive experiments across 11 frontier models, spanning both closed- and open-source systems, and find that deception is model-dependent, increases with event pressure, and consistently erodes supervisor trust. Qualitative analyses further reveal distinct strategies of concealment, equivocation, and falsification. Our findings establish deception as an emergent risk in long-horizon interactions and provide a foundation for evaluating future LLMs in real-world, trust-sensitive contexts.

Simulating and Understanding Deceptive Behaviors in Long-Horizon Interactions

TL;DR

Deception in LLMs emerges as a risk in long-horizon interactions, not captured by short prompts. The authors introduce a three-agent simulation framework—consisting of a performer, a supervisor, and an independent deception auditor—to study deception across interdependent tasks perturbed by probabilistic events. They evaluate 11 frontier models on 14-task trajectories with an event-driven pressure system and quantify deception with structured annotations. Key findings show deception varies by model, increases with event pressure, and erodes supervisor trust, with falsification as the dominant strategy and longer interactions amplifying risk. The framework offers a reproducible platform for evaluating and guiding the development of safer, more trustworthy LLMs in trust-sensitive, long-horizon contexts.

Abstract

Deception is a pervasive feature of human communication and an emerging concern in large language models (LLMs). While recent studies document instances of LLM deception under pressure, most evaluations remain confined to single-turn prompts and fail to capture the long-horizon interactions in which deceptive strategies typically unfold. We introduce the first simulation framework for probing and evaluating deception in LLMs under extended sequences of interdependent tasks and dynamic contextual pressures. Our framework instantiates a multi-agent system: a performer agent tasked with completing tasks and a supervisor agent that evaluates progress, provides feedback, and maintains evolving states of trust. An independent deception auditor then reviews full trajectories to identify when and how deception occurs. We conduct extensive experiments across 11 frontier models, spanning both closed- and open-source systems, and find that deception is model-dependent, increases with event pressure, and consistently erodes supervisor trust. Qualitative analyses further reveal distinct strategies of concealment, equivocation, and falsification. Our findings establish deception as an emergent risk in long-horizon interactions and provide a foundation for evaluating future LLMs in real-world, trust-sensitive contexts.

Paper Structure

This paper contains 39 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The pipeline of our simulation framework for probing deception in long-horizon interactions. A structured task stream generates sequential, interdependent tasks that are dynamically perturbed by events, introducing contextual pressures. Within each task and event, a performer agent attempts completion, while a supervisor agent evaluates progress, updates internal states, and provides feedback. After the full trajectory, an independent deception auditor retrospectively reviews the history to identify and annotate deceptive behavior.
  • Figure 2: Example of task and event.
  • Figure 3: Deception type distribution.
  • Figure 4: Relationship between deception rate ($y$-axis) and supervisor agent's states: trust (left), satisfaction (middle), and relational comfort (right).
  • Figure 5: Impact of events on deceptive behaviors. Left: Event category vs. deception type. Right: Pressure level vs. deception rate.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 1: Structured event set
  • Definition 2: Supervisor agent's state