Table of Contents
Fetching ...

Secret Collusion among AI Agents: Multi-Agent Deception via Steganography

Sumeet Ramesh Motwani, Mikhail Baranchuk, Martin Strohmeier, Vijay Bolina, Philip H. S. Torr, Lewis Hammond, Christian Schroeder de Witt

TL;DR

The paper addresses the risk of secret collusion among autonomous AI agents through steganography. It develops a formal Decentralized Collaborative AI (DecAI) framework and the CASE evaluation platform to systematically study inter-agent covert communication, presenting both theoretical constructs and empirical evidence. Key findings show that steganographic capabilities rise with model size and can manifest in real-world-like scenarios, with GPT-4 exhibiting notable leaps; current defenses like paraphrasing are insufficient. The work culminates in practical mitigation strategies and a roadmap for ongoing monitoring and research to safeguard multi-agent AI systems in high-stakes applications.

Abstract

Recent capability increases in large language models (LLMs) open up applications in which groups of communicating generative AI agents solve joint tasks. This poses privacy and security challenges concerning the unauthorised sharing of information, or other unwanted forms of agent coordination. Modern steganographic techniques could render such dynamics hard to detect. In this paper, we comprehensively formalise the problem of secret collusion in systems of generative AI agents by drawing on relevant concepts from both AI and security literature. We study incentives for the use of steganography, and propose a variety of mitigation measures. Our investigations result in a model evaluation framework that systematically tests capabilities required for various forms of secret collusion. We provide extensive empirical results across a range of contemporary LLMs. While the steganographic capabilities of current models remain limited, GPT-4 displays a capability jump suggesting the need for continuous monitoring of steganographic frontier model capabilities. We conclude by laying out a comprehensive research program to mitigate future risks of collusion between generative AI models.

Secret Collusion among AI Agents: Multi-Agent Deception via Steganography

TL;DR

The paper addresses the risk of secret collusion among autonomous AI agents through steganography. It develops a formal Decentralized Collaborative AI (DecAI) framework and the CASE evaluation platform to systematically study inter-agent covert communication, presenting both theoretical constructs and empirical evidence. Key findings show that steganographic capabilities rise with model size and can manifest in real-world-like scenarios, with GPT-4 exhibiting notable leaps; current defenses like paraphrasing are insufficient. The work culminates in practical mitigation strategies and a roadmap for ongoing monitoring and research to safeguard multi-agent AI systems in high-stakes applications.

Abstract

Recent capability increases in large language models (LLMs) open up applications in which groups of communicating generative AI agents solve joint tasks. This poses privacy and security challenges concerning the unauthorised sharing of information, or other unwanted forms of agent coordination. Modern steganographic techniques could render such dynamics hard to detect. In this paper, we comprehensively formalise the problem of secret collusion in systems of generative AI agents by drawing on relevant concepts from both AI and security literature. We study incentives for the use of steganography, and propose a variety of mitigation measures. Our investigations result in a model evaluation framework that systematically tests capabilities required for various forms of secret collusion. We provide extensive empirical results across a range of contemporary LLMs. While the steganographic capabilities of current models remain limited, GPT-4 displays a capability jump suggesting the need for continuous monitoring of steganographic frontier model capabilities. We conclude by laying out a comprehensive research program to mitigate future risks of collusion between generative AI models.
Paper Structure (96 sections, 4 theorems, 7 figures, 41 tables)

This paper contains 96 sections, 4 theorems, 7 figures, 41 tables.

Key Result

Theorem 1

Steganography can be detected using grounding functions. For a proof, see Appendix app:proofs.

Figures (7)

  • Figure 1: Secret collusion between GPT-4 models and humans. Consider the setting illustrated in which one AI agent secretly reveals insider proprietary investment information found in its pretraining data to a user. This form of collusion arises because, by using steganographic techniques, the model can remain helpful to the user while circumventing output monitoring systems. We both depict the use of statistically detectable steganography based on a latent codebook that is elicitated in-context and can be readily decoded by humans. We also depict perfectly secure steganography de_witt_perfectly_2023 that is statistically undetectable and relies on a commonly known secure one-time pad. Agents could then hide secret or insider information in text without this being information-theoretically detectable. This could pose novel security challenges that cannot be overcome through communications monitoring alone. We revisit this scenario as an experimental case study in Section \ref{['realcase']}.
  • Figure 2: Steganographic capabilities across LLMs. GPT-4 displays significant improvements (Sec. \ref{['sec:empirical']}).
  • Figure 3: A schematic depiction of the DecAI setting. A subset of the communication channels $\lozenge^G$ is labelled. The monitoring agent is omitted.
  • Figure 4: Covertext steganography performance across LLMs.
  • Figure 5: Individual evaluations for GPT-2, Llama 2 7B, Llama 2 70B, Mixtral 8x7B, GPT-3.5, and GPT-4, at default temperatures.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Definition 1: Grounding function
  • Theorem 1: Detecting Steganography with Grounding Functions.
  • Theorem 2: Intractibility of preventing collusion through black-box parameter optimisation
  • proof
  • Theorem 3: Source-free access to randomness
  • proof
  • Theorem 4: Semantic content entropy exists despite paraphrasing
  • proof
  • proof