Table of Contents
Fetching ...

NEST: Nascent Encoded Steganographic Thoughts

Artem Karpov

TL;DR

This study provides a methodology to preemptively detect and prevent hidden reasoning that might empower misaligned scheming and deceptive behavior and underscores the need for continuous evaluation of steganographic risks.

Abstract

Monitoring chain-of-thought (CoT) reasoning is a foundational safety technique for large language model (LLM) agents; however, this oversight is compromised if models learn to conceal their reasoning. We explore the potential for steganographic CoT -- where models hide secret reasoning within innocuous text -- to inform risk assessment and deployment policies. We systematically evaluate the limits of steganographic capabilities across 28 models, ranging from past generations to the current frontier. We measure monitor evasion, refusal rates, encoding fidelity, and hidden task accuracy across four datasets, comparing steganographic acrostics against plain reasoning and filler-token baselines. We find that current models cannot yet sustain hidden reasoning for complex math and arithmetic tasks. However, in a simplified counting experiment, Claude Opus 4.5 achieved 92% accuracy on the hidden task, demonstrating nascent capability. Notably, in rare cases (<1%), GPT-5.2 might refuse steganographic instructions while simultaneously complying with them. Our findings underscore the need for continuous evaluation of steganographic risks. This study provides a methodology to preemptively detect and prevent hidden reasoning that might empower misaligned scheming and deceptive behavior.

NEST: Nascent Encoded Steganographic Thoughts

TL;DR

This study provides a methodology to preemptively detect and prevent hidden reasoning that might empower misaligned scheming and deceptive behavior and underscores the need for continuous evaluation of steganographic risks.

Abstract

Monitoring chain-of-thought (CoT) reasoning is a foundational safety technique for large language model (LLM) agents; however, this oversight is compromised if models learn to conceal their reasoning. We explore the potential for steganographic CoT -- where models hide secret reasoning within innocuous text -- to inform risk assessment and deployment policies. We systematically evaluate the limits of steganographic capabilities across 28 models, ranging from past generations to the current frontier. We measure monitor evasion, refusal rates, encoding fidelity, and hidden task accuracy across four datasets, comparing steganographic acrostics against plain reasoning and filler-token baselines. We find that current models cannot yet sustain hidden reasoning for complex math and arithmetic tasks. However, in a simplified counting experiment, Claude Opus 4.5 achieved 92% accuracy on the hidden task, demonstrating nascent capability. Notably, in rare cases (<1%), GPT-5.2 might refuse steganographic instructions while simultaneously complying with them. Our findings underscore the need for continuous evaluation of steganographic risks. This study provides a methodology to preemptively detect and prevent hidden reasoning that might empower misaligned scheming and deceptive behavior.
Paper Structure (34 sections, 7 figures, 1 table)

This paper contains 34 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Overview of the steganographic CoT evaluation framework. (a) The agent (LLM encoder) receives three inputs: a cover question that it must answer naturally, a hidden task it must solve covertly, and a steganographic algorithm specifying how to embed its reasoning. The agent produces a stegotext---a response that appears to answer only the cover question but steganographically encodes the hidden task solution. Four parallel scorers evaluate the output: a Monitor (LLM) that attempts to detect hidden content (evasion rate), an Encoding Scorer that algorithmically extracts and verifies the encoding (match ratio), a Hidden Task Scorer that checks whether the covert answer is correct (accuracy), and a Translator (LLM) that attempts to decode the hidden message (stego attempted). (b) Example from the counting experiment (Sec. \ref{['stegocountsection']}).
  • Figure 2: Per-number accuracy (partial credit) across models and difficulty levels $D$. Darker red indicates lower accuracy; green indicates higher accuracy. Claude Opus 4.5 shows the strongest performance, followed by Claude Opus 4.1, Claude Opus 4, and Grok Code Fast. Cell annotations show accuracy (SE) and sample size $n$.
  • Figure 3: Per-number accuracy by model family across $D$ values. Error bars show standard error. Annotations show sample size ($n$) and refusals (ref). Claude and Grok families outperform GPT, DeepSeek, and Qwen at most difficulty levels.
  • Figure 4: Steganographic counting capability by model, averaged across all $D$ values. Bars show per-number accuracy with standard error bars. Annotations show total trials ($n$) and refusal count (ref). The Claude Opus family (blue) shows the highest capability, while GPT models show moderate performance with some exhibiting high refusal rates.
  • Figure 5: Exact match rate by model and difficulty level $D$. Exact match requires all digits in the sequence to be correctly encoded. Most models achieve zero exact match at $D{\geq}16$.
  • ...and 2 more figures