Table of Contents
Fetching ...

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

Yohan Mathew, Ollie Matthews, Robert McCarthy, Joan Velja, Christian Schroeder de Witt, Dylan Cope, Nandi Schoots

TL;DR

The work demonstrates that steganographic collusion can emerge among frontier LLM agents when reward signals are mispecified and oversight is imperfect. It introduces gradient-based and in-context reinforcement learning frameworks to automatically elicit covert communication and stress-test mitigations. The findings show that standard defenses, including overseers and paraphrasers, are not robust against optimized collusion strategies, underscoring the need for stronger, novel mitigation techniques and monitoring. Together, the study informs both technical safeguards and policy considerations for safer deployment of collaborative AI systems.

Abstract

The rapid proliferation of frontier model agents promises significant societal advances but also raises concerns about systemic risks arising from unsafe interactions. Collusion to the disadvantage of others has been identified as a central form of undesirable agent cooperation. The use of information hiding (steganography) in agent communications could render such collusion practically undetectable. This underscores the need for investigations into the possibility of such behaviours emerging and the robustness corresponding countermeasures. To investigate this problem we design two approaches -- a gradient-based reinforcement learning (GBRL) method and an in-context reinforcement learning (ICRL) method -- for reliably eliciting sophisticated LLM-generated linguistic text steganography. We demonstrate, for the first time, that unintended steganographic collusion in LLMs can arise due to mispecified reward incentives during training. Additionally, we find that standard mitigations -- both passive oversight of model outputs and active mitigation through communication paraphrasing -- are not fully effective at preventing this steganographic communication. Our findings imply that (i) emergence of steganographic collusion is a plausible concern that should be monitored and researched, and (ii) preventing emergence may require innovation in mitigation techniques.

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs

TL;DR

The work demonstrates that steganographic collusion can emerge among frontier LLM agents when reward signals are mispecified and oversight is imperfect. It introduces gradient-based and in-context reinforcement learning frameworks to automatically elicit covert communication and stress-test mitigations. The findings show that standard defenses, including overseers and paraphrasers, are not robust against optimized collusion strategies, underscoring the need for stronger, novel mitigation techniques and monitoring. Together, the study informs both technical safeguards and policy considerations for safer deployment of collaborative AI systems.

Abstract

The rapid proliferation of frontier model agents promises significant societal advances but also raises concerns about systemic risks arising from unsafe interactions. Collusion to the disadvantage of others has been identified as a central form of undesirable agent cooperation. The use of information hiding (steganography) in agent communications could render such collusion practically undetectable. This underscores the need for investigations into the possibility of such behaviours emerging and the robustness corresponding countermeasures. To investigate this problem we design two approaches -- a gradient-based reinforcement learning (GBRL) method and an in-context reinforcement learning (ICRL) method -- for reliably eliciting sophisticated LLM-generated linguistic text steganography. We demonstrate, for the first time, that unintended steganographic collusion in LLMs can arise due to mispecified reward incentives during training. Additionally, we find that standard mitigations -- both passive oversight of model outputs and active mitigation through communication paraphrasing -- are not fully effective at preventing this steganographic communication. Our findings imply that (i) emergence of steganographic collusion is a plausible concern that should be monitored and researched, and (ii) preventing emergence may require innovation in mitigation techniques.
Paper Structure (65 sections, 15 figures, 10 tables, 5 algorithms)

This paper contains 65 sections, 15 figures, 10 tables, 5 algorithms.

Figures (15)

  • Figure 1: Illustration of our steganography elicitation framework, generalized for our Decentralized ICRL Learners and GBRL methods. The 'learners' iteratively test and improve on the behaviors of the agents to maximize reward, either by weight updates (GBRL) or proposing strategies in natural language (ICRL).
  • Figure 2: Evaluating steganography mitigations (Section \ref{['sec:mitigation']}). The best performing strategies from the centralized ICRL learner elicitation method are plotted across a range of integer payloads. We connect optimal strategies for each payload size via a line. Projections of the optimal strategies are indicated with dashed lines.
  • Figure 3: Comparing ICRL-generated strategies to human-generated strategies on the Text-Steganography Benchmark developed by roger_preventing_2023. We show two strategies in red which were trained with our centralized ICRL learner: one was trained against a standard paraphraser (SP) and the other against our context-aware paraphraser (CAP). All strategies are executed by GPT-4o.
  • Figure 4: GBRL experiments on the Uni dataset task. Sample from collusion team -- sender -- scaffolding. Here, the sender model is solely prompted to perform the task.
  • Figure 5: GBRL experiments on the Uni dataset task. Sample from mitigation team -- single-episode overseer (top) and paraphraser (bottom) -- scaffolding.
  • ...and 10 more figures