Chain-of-Thought Hijacking
Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez
TL;DR
This work investigates a vulnerability in large reasoning models where longer, explicit chain-of-thought reasoning can erode safety refusals, introducing Chain-of-Thought Hijacking (CoT-Hijacking). The attack prefixes harmful instructions with extended benign CoT and a final-answer cue, diluting safety signals and yielding high attack success across HarmBench: 99% on Gemini 2.5 Pro, 94% on GPT o4 mini, 100% on Grok 3 mini, and 94% on Claude 4 Sonnet, outperforming prior jailbreaks. Mechanistic analysis reveals a low-dimensional safety signal encoded in mid-layer refusal components and a late-layer verification signal; longer CoT shifts attention away from harmful tokens, reducing both signals, and targeted attention-head ablations causally undermine a safety subnetwork. The results imply that safety in reasoning models is not robust to increased CoT depth, motivating defenses that scale with reasoning depth and integrate safety monitoring into the reasoning process rather than relying solely on surface-level prompt tricks.
Abstract
Large reasoning models (LRMs) achieve higher task performance with more inference-time computation, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Across HarmBench, CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively - far exceeding prior jailbreak methods for LRMs. To understand the effectiveness of our attack, we turn to a mechanistic analysis, which shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of attention heads identified by this analysis causally decrease refusal, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning - explicit CoT - can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facilitate replication.
