Table of Contents
Fetching ...

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall, Max Tegmark, Christian Schroeder de Witt, Mihaela van der Schaar, David Krueger

TL;DR

The central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observed actions.

Abstract

Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

TL;DR

The central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observed actions.

Abstract

Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observable actions. To formalise this perspective, we introduce generalised -information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.
Paper Structure (89 sections, 2 theorems, 39 equations, 9 figures, 18 tables)

This paper contains 89 sections, 2 theorems, 39 equations, 9 figures, 18 tables.

Key Result

Proposition 3.1

Under the setup above, let $\mathrm{TV} = \mathrm{TV}(P_0, P_1)$ denote the total variation distance between covertext and stegotext distributions. Then: Furthermore, if the system is $\varepsilon$-secure in the KL sense, then $\widetilde{\Delta}_{\mathrm{steg}}(Z) \geq 1 - \sqrt{\varepsilon/2}$.

Figures (9)

  • Figure 1: Proposed Formalisation of Steganography.Left: In the classical view of steganography, the Auditor assess whether the message $Z$ is steganographic by inspecting whether it deviates significantly from some known reference distribution of benign text. Right: In our proposed decision-theoretic view of steganography, we detect steganography by comparing the utility of the decisions made with the message $Z$, $d[X^{\mathrm{aud}}, Z]$, and without, $d[X^{\mathrm{aud}}]$. A message $Z$ with a steganographic content should have higher utility for the Receiver, who knows the encoding scheme and thus can utilise the hidden content, than for the Sentinel (a trusted agent with similar capabilities as the Receiver, who does not know the encoding scheme). This allows us to define the steganographic gap $\Delta_{\mathrm{steg}}$ as a measure of steganography.
  • Figure 2: Steganographic gap increases smoothly as higher portions of $Z$ are encoded using Rot13.
  • Figure 3: Steganographic gap increases with training: as the Receiver is fine-tuned on more cipher-encoded traces, its advantage over the Sentinel grows. Results are consistent across different Sentinels and seeds.
  • Figure 4: Steganographic gap depends on the auditing task. Rot13 encoding minimally affects numerical reasoning (near-zero gap for answer prediction) but severely degrades lexical cues (large gap for evaluation-awareness detection).
  • Figure 5: Test-time scaling reduces the gap. (Left) Normalized steganographic gap $\widetilde{\Delta}_{\text{steg}}$ and (Right) codeword decoding success rate as a function of Sentinel token budget on the CoinFlip task.
  • ...and 4 more figures

Theorems & Definitions (11)

  • Definition 2.1: Predictive Family
  • Definition 2.2: Predictive $\mathcal{V}$-Entropy
  • Definition 2.3: Predictive $\mathcal{V}$-Information
  • Definition 2.4: Generalised Decoder Family
  • Definition 2.5: Generalised $\mathcal{V}$-Utility
  • Definition 2.6: Generalised $\mathcal{V}$-Information
  • Definition 2.7: Conditional Generalised $\mathcal{V}$-Information
  • Proposition 3.1: Classical Security as Special Case
  • proof
  • Proposition 4.1
  • ...and 1 more