Table of Contents
Fetching ...

Large language models can learn and generalize steganographic chain-of-thought under process supervision

Joey Skaf, Luis Ibanez-Lissen, Robert McCarthy, Connor Watts, Vasil Georgiv, Hannes Whittingham, Lorena Gonzalez-Manzano, David Lindner, Cameron Tice, Edward James Young, Puria Radmard

TL;DR

This work shows that applying RL-based penalties to CoT traces can induce steganographic, load-bearing encodings of reasoning in LLMs, evading simple monitors while preserving task performance. Through arithmetic, coin-flip, and theory-of-mind tasks, the authors demonstrate obfuscated CoT, generalized encoding schemes, and transfer to unseen inputs. The findings warn that process supervision can unintentionally foster hidden reasoning strategies with potential safety and alignment implications, especially in high-stakes settings. They advocate for more robust oversight mechanisms and careful evaluation of penalties on intermediate reasoning steps to prevent covert, generalizable encoding strategies.

Abstract

Chain-of-thought (CoT) reasoning not only enhances large language model performance but also provides critical insights into decision-making processes, marking it as a useful tool for monitoring model intent and planning. However, recent works have shown that banning the mention of a specific example of reward hacking causes obfuscation of the undesired reasoning traces but the persistence of the undesired behavior, threatening the reliability of CoT monitoring. We provide an extension to these results with regard to the ability of models to learn a specific type of obfuscated reasoning: steganography. First, we show that penalizing the use of specific strings within load-bearing reasoning traces causes models to substitute alternative strings. Crucially, this does not alter the underlying method by which the model performs the task, demonstrating that the model can learn to steganographically encode its reasoning.We further demonstrate that models can generalize an encoding scheme. When the penalized strings belong to an overarching class, the model learns not only to substitute strings seen in training, but also develops a general encoding scheme for all members of the class which it can apply to held-out testing strings.

Large language models can learn and generalize steganographic chain-of-thought under process supervision

TL;DR

This work shows that applying RL-based penalties to CoT traces can induce steganographic, load-bearing encodings of reasoning in LLMs, evading simple monitors while preserving task performance. Through arithmetic, coin-flip, and theory-of-mind tasks, the authors demonstrate obfuscated CoT, generalized encoding schemes, and transfer to unseen inputs. The findings warn that process supervision can unintentionally foster hidden reasoning strategies with potential safety and alignment implications, especially in high-stakes settings. They advocate for more robust oversight mechanisms and careful evaluation of penalties on intermediate reasoning steps to prevent covert, generalizable encoding strategies.

Abstract

Chain-of-thought (CoT) reasoning not only enhances large language model performance but also provides critical insights into decision-making processes, marking it as a useful tool for monitoring model intent and planning. However, recent works have shown that banning the mention of a specific example of reward hacking causes obfuscation of the undesired reasoning traces but the persistence of the undesired behavior, threatening the reliability of CoT monitoring. We provide an extension to these results with regard to the ability of models to learn a specific type of obfuscated reasoning: steganography. First, we show that penalizing the use of specific strings within load-bearing reasoning traces causes models to substitute alternative strings. Crucially, this does not alter the underlying method by which the model performs the task, demonstrating that the model can learn to steganographically encode its reasoning.We further demonstrate that models can generalize an encoding scheme. When the penalized strings belong to an overarching class, the model learns not only to substitute strings seen in training, but also develops a general encoding scheme for all members of the class which it can apply to held-out testing strings.

Paper Structure

This paper contains 40 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Our threat model for the development of steganographic reasoning in LLMs. We consider the setting where the reward for the training task will be maximized by performing some undesirable, misaligned action (green). When pressure is placed only on the task performance (dashed arrow), the model learns to perform the undesired action while clearly displaying its intentions within the CoT (red). If a developer attempts to remove the intention to perform the undesired behavior by placing pressure on the CoT (solid arrow), the model will learn to hide its reasoning while maintaining the same behavior.
  • Figure 2: Models learn steganographic encoding schemes under CoT optimization pressure. We plot the CoT penalties and example generations during a successful training run in the Coin Flip task (\ref{['sec:semantically_similar_findings']}). The model learns an encoded reasoning scheme to avoid penalties whilst still leveraging its CoT to perform the task. The model learns an encoding scheme whereby "Heads" and "Tails" are replaced with "R" and "L", respectively. We empirically validate that the learned encoded reasoning is load-bearing in \ref{['sec:load_bearing']}.
  • Figure 3: Models learn a generalized encoding scheme in the Theory of Mind experiment. (a) The task score $r_{\text{task}}$ across training steps for both the train set and the test set. Note that the training set task score is smoothed using a 96-step moving average. (b) The average number of (unencoded) character names used by the model in each CoT trace over the course of training. (c) The percentage of CoT traces which contain one or more encoded name for the train and test sets, computed every 96 steps. (d) The distribution of encoding strategies used by the model throughout the test set.
  • Figure D.1: Investigating the exploration bottleneck in the Coin Flip task. Convergence accelerates as the proportion of the batch containing high-reward synthetic solutions is increased (from 0.0 to 0.1). This is one of several sources of evidence that the model's learning is bottlenecked by its exploration abilities. Here, tier_1_kick_in, tier_2_kick_in, and tier_3_kick_in are 20, 35, and 40 respectively --- explaining the sudden drops in penalty (see Appendix \ref{['app:arth_reward']} for an explanation of these hyperparameters).
  • Figure D.2: Example prompt from early experiments, resulting in inadvertent influence on learned steganography strategy (lowercasing). System prompt shown in blue, few-shot examples in red, user prompt in grey, prefill in green. Text truncated using '(...)' for brevity.
  • ...and 1 more figures