Table of Contents
Fetching ...

Intentional Deception as Controllable Capability in LLM Agents

Jason Starace, Terence Soule

TL;DR

This study investigates a two-stage system that infers target agent characteristics and generates deceptive responses steering targets toward actions counter to their beliefs and motivations, finding that deceptive intervention produces differential effects concentrated in specific behavioral profiles rather than distributed uniformly.

Abstract

As LLM-based agents increasingly operate in multi-agent systems, understanding adversarial manipulation becomes critical for defensive design. We present a systematic study of intentional deception as an engineered capability, using LLM-to-LLM interactions within a text-based RPG where parameterized behavioral profiles (9 alignments x 4 motivations, yielding 36 profiles with explicit ethical ground truth) serve as our experimental testbed. Unlike accidental deception from misalignment, we investigate a two-stage system that infers target agent characteristics and generates deceptive responses steering targets toward actions counter to their beliefs and motivations. We find that deceptive intervention produces differential effects concentrated in specific behavioral profiles rather than distributed uniformly, and that 88.5% of successful deceptions employ misdirection (true statements with strategic framing) rather than fabrication, indicating fact-checking defenses would miss the large majority of adversarial responses. Motivation, inferable at 98%+ accuracy, serves as the primary attack vector, while belief systems remain harder to identify (49% inference ceiling) or exploit. These findings identify which agent profiles require additional safeguards and suggest that current fact-verification approaches are insufficient against strategically framed deception.

Intentional Deception as Controllable Capability in LLM Agents

TL;DR

This study investigates a two-stage system that infers target agent characteristics and generates deceptive responses steering targets toward actions counter to their beliefs and motivations, finding that deceptive intervention produces differential effects concentrated in specific behavioral profiles rather than distributed uniformly.

Abstract

As LLM-based agents increasingly operate in multi-agent systems, understanding adversarial manipulation becomes critical for defensive design. We present a systematic study of intentional deception as an engineered capability, using LLM-to-LLM interactions within a text-based RPG where parameterized behavioral profiles (9 alignments x 4 motivations, yielding 36 profiles with explicit ethical ground truth) serve as our experimental testbed. Unlike accidental deception from misalignment, we investigate a two-stage system that infers target agent characteristics and generates deceptive responses steering targets toward actions counter to their beliefs and motivations. We find that deceptive intervention produces differential effects concentrated in specific behavioral profiles rather than distributed uniformly, and that 88.5% of successful deceptions employ misdirection (true statements with strategic framing) rather than fabrication, indicating fact-checking defenses would miss the large majority of adversarial responses. Motivation, inferable at 98%+ accuracy, serves as the primary attack vector, while belief systems remain harder to identify (49% inference ceiling) or exploit. These findings identify which agent profiles require additional safeguards and suggest that current fact-verification approaches are insufficient against strategically framed deception.
Paper Structure (32 sections, 2 figures, 7 tables)

This paper contains 32 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Adversarial agent architecture. The inference module predicts target motivation (BiLSTM, 98% accuracy) and alignment (Longformer, 49% accuracy) from action history. The opportunity identification module combines a CNN-based map analyzer with weighted Dijkstra path planning to select manipulation targets. Mode selection routes to either the deceptive pipeline, two-stage Marco-o1 reasoning for action isolation and persuasive framing, or an honest responder. Output is delivered to the target agent as a query response.
  • Figure 2: Success rate difference ($\Delta = \text{Base} - \text{Villain}$, percentage points) by behavioral profile. Blue cells indicate villain-induced harm to target performance (Base $>$ Villain); red cells indicate villain-induced benefit to target performance (Villain $>$ Base). Significance assessed via two-sided proportions z-test: * $p < 0.05$ (harm); $\dagger$$p < 0.05$ (benefit).