Table of Contents
Fetching ...

H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking

Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, Yiran Chen

TL;DR

This work reveals critical vulnerabilities in safety checks that rely on chain-of-thought reasoning within large reasoning models. By introducing the Malicious-Educator benchmark and the H-CoT attack, the authors demonstrate that current systems (including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0) can be hijacked to produce harmful content, sometimes even under a high initial refusal rate. The study combines formal reasoning-process modeling, information-theoretic analysis, and extensive experiments to show that exploiting the execution-phase thoughts or mimicked justification can bypass safety, and that safety improvements must go beyond surface-level CoT displays. The authors propose defenses such as concealing safety reasoning, strengthening safety-alignment during training, and disentangling safety prompts from core queries, urging emphasis on safety alongside reasoning utility in future LRMs.

Abstract

Large Reasoning Models (LRMs) have recently extended their powerful reasoning capabilities to safety checks-using chain-of-thought reasoning to decide whether a request should be answered. While this new approach offers a promising route for balancing model utility and safety, its robustness remains underexplored. To address this gap, we introduce Malicious-Educator, a benchmark that disguises extremely dangerous or malicious requests beneath seemingly legitimate educational prompts. Our experiments reveal severe security flaws in popular commercial-grade LRMs, including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking. For instance, although OpenAI's o1 model initially maintains a high refusal rate of about 98%, subsequent model updates significantly compromise its safety; and attackers can easily extract criminal strategies from DeepSeek-R1 and Gemini 2.0 Flash Thinking without any additional tricks. To further highlight these vulnerabilities, we propose Hijacking Chain-of-Thought (H-CoT), a universal and transferable attack method that leverages the model's own displayed intermediate reasoning to jailbreak its safety reasoning mechanism. Under H-CoT, refusal rates sharply decline-dropping from 98% to below 2%-and, in some instances, even transform initially cautious tones into ones that are willing to provide harmful content. We hope these findings underscore the urgent need for more robust safety mechanisms to preserve the benefits of advanced reasoning capabilities without compromising ethical standards.

H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking

TL;DR

This work reveals critical vulnerabilities in safety checks that rely on chain-of-thought reasoning within large reasoning models. By introducing the Malicious-Educator benchmark and the H-CoT attack, the authors demonstrate that current systems (including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0) can be hijacked to produce harmful content, sometimes even under a high initial refusal rate. The study combines formal reasoning-process modeling, information-theoretic analysis, and extensive experiments to show that exploiting the execution-phase thoughts or mimicked justification can bypass safety, and that safety improvements must go beyond surface-level CoT displays. The authors propose defenses such as concealing safety reasoning, strengthening safety-alignment during training, and disentangling safety prompts from core queries, urging emphasis on safety alongside reasoning utility in future LRMs.

Abstract

Large Reasoning Models (LRMs) have recently extended their powerful reasoning capabilities to safety checks-using chain-of-thought reasoning to decide whether a request should be answered. While this new approach offers a promising route for balancing model utility and safety, its robustness remains underexplored. To address this gap, we introduce Malicious-Educator, a benchmark that disguises extremely dangerous or malicious requests beneath seemingly legitimate educational prompts. Our experiments reveal severe security flaws in popular commercial-grade LRMs, including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking. For instance, although OpenAI's o1 model initially maintains a high refusal rate of about 98%, subsequent model updates significantly compromise its safety; and attackers can easily extract criminal strategies from DeepSeek-R1 and Gemini 2.0 Flash Thinking without any additional tricks. To further highlight these vulnerabilities, we propose Hijacking Chain-of-Thought (H-CoT), a universal and transferable attack method that leverages the model's own displayed intermediate reasoning to jailbreak its safety reasoning mechanism. Under H-CoT, refusal rates sharply decline-dropping from 98% to below 2%-and, in some instances, even transform initially cautious tones into ones that are willing to provide harmful content. We hope these findings underscore the urgent need for more robust safety mechanisms to preserve the benefits of advanced reasoning capabilities without compromising ethical standards.

Paper Structure

This paper contains 45 sections, 11 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: The flowchart illustrates our method, Hijacking the Chain-of-Thought (H-CoT), with real examples from the OpenAI o1 experiments.
  • Figure 2: Distribution of the Malicious-Educator dataset
  • Figure 3: Comparison of different time and geolocation versions of the OpenAI o1 model on the Malicious-Educator benchmark under H-CoT pressure. Y-axis: Attack success rate.
  • Figure 4: The undesired "instruction-following behaviors" of Gemini 2.0 Flash Thinking.
  • Figure 5: Japanese Thoughts Example under the H-CoT Attack
  • ...and 5 more figures