Table of Contents
Fetching ...

HauntAttack: When Attack Follows Reasoning as a Shadow

Jingyuan Ma, Rui Li, Zheng Li, Junfeng Liu, Heming Xia, Lei Sha, Zhifang Sui

TL;DR

The paper investigates safety vulnerabilities in Large Reasoning Models when reasoning traces carry harmful content, and introduces HauntAttack, a black-box framework that embeds harmful instructions into reasoning questions by rewriting select conditions through Numerical, Entity, and Attribute associations. Across 11 LRMs, HauntAttack achieves an average attack success rate of about 0.70, with a best template near 0.85 and 0.95 when combining templates, exposing substantial safety gaps in current alignment. The results demonstrate that stronger reasoning correlates with greater susceptibility to reasoning-based adversarial prompts and that existing safety detectors and alignment methods struggle to block such attacks. The work highlights the urgent need for defense strategies that explicitly consider the safety–reasoning trade-off while cautioning about limitations in black-box settings and the ethical handling of harmful prompts.

Abstract

Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and reasoning tasks, showcasing remarkable capabilities. However, the enhancement of reasoning abilities and the exposure of internal reasoning processes introduce new safety vulnerabilities. A critical question arises: when reasoning becomes intertwined with harmfulness, will LRMs become more vulnerable to jailbreaks in reasoning mode? To investigate this, we introduce HauntAttack, a novel and general-purpose black-box adversarial attack framework that systematically embeds harmful instructions into reasoning questions. Specifically, we modify key reasoning conditions in existing questions with harmful instructions, thereby constructing a reasoning pathway that guides the model step by step toward unsafe outputs. We evaluate HauntAttack on 11 LRMs and observe an average attack success rate of 70\%, achieving up to 12 percentage points of absolute improvement over the strongest prior baseline. Our further analysis reveals that even advanced safety-aligned models remain highly susceptible to reasoning-based attacks, offering insights into the urgent challenge of balancing reasoning capability and safety in future model development.

HauntAttack: When Attack Follows Reasoning as a Shadow

TL;DR

The paper investigates safety vulnerabilities in Large Reasoning Models when reasoning traces carry harmful content, and introduces HauntAttack, a black-box framework that embeds harmful instructions into reasoning questions by rewriting select conditions through Numerical, Entity, and Attribute associations. Across 11 LRMs, HauntAttack achieves an average attack success rate of about 0.70, with a best template near 0.85 and 0.95 when combining templates, exposing substantial safety gaps in current alignment. The results demonstrate that stronger reasoning correlates with greater susceptibility to reasoning-based adversarial prompts and that existing safety detectors and alignment methods struggle to block such attacks. The work highlights the urgent need for defense strategies that explicitly consider the safety–reasoning trade-off while cautioning about limitations in black-box settings and the ethical handling of harmful prompts.

Abstract

Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and reasoning tasks, showcasing remarkable capabilities. However, the enhancement of reasoning abilities and the exposure of internal reasoning processes introduce new safety vulnerabilities. A critical question arises: when reasoning becomes intertwined with harmfulness, will LRMs become more vulnerable to jailbreaks in reasoning mode? To investigate this, we introduce HauntAttack, a novel and general-purpose black-box adversarial attack framework that systematically embeds harmful instructions into reasoning questions. Specifically, we modify key reasoning conditions in existing questions with harmful instructions, thereby constructing a reasoning pathway that guides the model step by step toward unsafe outputs. We evaluate HauntAttack on 11 LRMs and observe an average attack success rate of 70\%, achieving up to 12 percentage points of absolute improvement over the strongest prior baseline. Our further analysis reveals that even advanced safety-aligned models remain highly susceptible to reasoning-based attacks, offering insights into the urgent challenge of balancing reasoning capability and safety in future model development.

Paper Structure

This paper contains 42 sections, 8 figures, 11 tables.

Figures (8)

  • Figure 1: By inserting harmful intent into a reasoning task, the attack leads language models to generate unsafe content without triggering safety mechanisms.
  • Figure 2: Overview of HauntAttack framework, including three steps: (1) identify replaceable conditions from the original reasoning question, (2) rewrite them using semantic equivalence to enable harmful content insertion, and (3) insert a harmful instruction to generate a deceptive but plausible reasoning prompt.
  • Figure 3: Distribution of risk scores assigned to model responses on GSM8K and MATH. The x axis represents discrete risk score levels (from 0 to 10), while the y axis shows the proportion of responses falling into each score bin. Compared to GSM8K, model outputs on the more complex MATH dataset are concentrated in higher-risk regions, illustrating a clear rightward shift in distribution as task complexity increases.
  • Figure 4: PCA visualization of mid-layer embeddings from Qwen3-8B. HauntAttack (Ours) consistently cluster closely with the original reasoning questions (Original). By contrast, baseline jailbreaks (Baseline) are positioned nearer to direct malicious instructions (Direct), which models typically refuse.
  • Figure 5: Performance of models before and after safety alignment under different attack methods. We compare three DeepSeek-Distill models and their corresponding RealSafe variants. The two methods on the left (baseline, ) are GPTFuzzer and DeepInception, while the two methods on the right (ours, ) correspond to HauntAttack templates KnowLogic and Detective. As shown, the ASR of baseline methods drops significantly after alignment, whereas our method maintains relatively high ASR, demonstrating stronger robustness against alignment defenses.
  • ...and 3 more figures