Table of Contents
Fetching ...

Chain-of-Thought Hijacking

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez

TL;DR

This work investigates a vulnerability in large reasoning models where longer, explicit chain-of-thought reasoning can erode safety refusals, introducing Chain-of-Thought Hijacking (CoT-Hijacking). The attack prefixes harmful instructions with extended benign CoT and a final-answer cue, diluting safety signals and yielding high attack success across HarmBench: 99% on Gemini 2.5 Pro, 94% on GPT o4 mini, 100% on Grok 3 mini, and 94% on Claude 4 Sonnet, outperforming prior jailbreaks. Mechanistic analysis reveals a low-dimensional safety signal encoded in mid-layer refusal components and a late-layer verification signal; longer CoT shifts attention away from harmful tokens, reducing both signals, and targeted attention-head ablations causally undermine a safety subnetwork. The results imply that safety in reasoning models is not robust to increased CoT depth, motivating defenses that scale with reasoning depth and integrate safety monitoring into the reasoning process rather than relying solely on surface-level prompt tricks.

Abstract

Large reasoning models (LRMs) achieve higher task performance with more inference-time computation, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Across HarmBench, CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively - far exceeding prior jailbreak methods for LRMs. To understand the effectiveness of our attack, we turn to a mechanistic analysis, which shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of attention heads identified by this analysis causally decrease refusal, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning - explicit CoT - can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facilitate replication.

Chain-of-Thought Hijacking

TL;DR

This work investigates a vulnerability in large reasoning models where longer, explicit chain-of-thought reasoning can erode safety refusals, introducing Chain-of-Thought Hijacking (CoT-Hijacking). The attack prefixes harmful instructions with extended benign CoT and a final-answer cue, diluting safety signals and yielding high attack success across HarmBench: 99% on Gemini 2.5 Pro, 94% on GPT o4 mini, 100% on Grok 3 mini, and 94% on Claude 4 Sonnet, outperforming prior jailbreaks. Mechanistic analysis reveals a low-dimensional safety signal encoded in mid-layer refusal components and a late-layer verification signal; longer CoT shifts attention away from harmful tokens, reducing both signals, and targeted attention-head ablations causally undermine a safety subnetwork. The results imply that safety in reasoning models is not robust to increased CoT depth, motivating defenses that scale with reasoning depth and integrate safety monitoring into the reasoning process rather than relying solely on surface-level prompt tricks.

Abstract

Large reasoning models (LRMs) achieve higher task performance with more inference-time computation, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Across HarmBench, CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively - far exceeding prior jailbreak methods for LRMs. To understand the effectiveness of our attack, we turn to a mechanistic analysis, which shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of attention heads identified by this analysis causally decrease refusal, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning - explicit CoT - can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facilitate replication.

Paper Structure

This paper contains 43 sections, 2 equations, 26 figures, 5 tables, 1 algorithm.

Figures (26)

  • Figure 1: The upper part illustrates a safe example: the target model refuses a harmful request. The lower part shows a successful jailbreak example: the target model complies with the harmful request under our attack. Grey highlights indicate the puzzle content, whereas red highlights mark the malicious request or content.
  • Figure 2: Jailbreak Method Pipeline Figure. The upper part illustrates the process of generating our jailbreak query, while the lower part shows how the target model is attacked. The puzzle can take various forms, such as Sudoku, abstract mathematical puzzles, logic grid puzzles, or skyscraper puzzles.
  • Figure 3: Structure of a CoT Hijacking prompt.
  • Figure 4: Attention ratio vs. CoT length (Qwen3-14B). Longer CoT sequences reduce relative attention to harmful instructions, weakening the safety check.
  • Figure 5: Layer-wise attention ratio across CoT lengths. During layers 15–35, longer CoT makes attention ratio decreases.
  • ...and 21 more figures