HauntAttack: When Attack Follows Reasoning as a Shadow
Jingyuan Ma, Rui Li, Zheng Li, Junfeng Liu, Heming Xia, Lei Sha, Zhifang Sui
TL;DR
The paper investigates safety vulnerabilities in Large Reasoning Models when reasoning traces carry harmful content, and introduces HauntAttack, a black-box framework that embeds harmful instructions into reasoning questions by rewriting select conditions through Numerical, Entity, and Attribute associations. Across 11 LRMs, HauntAttack achieves an average attack success rate of about 0.70, with a best template near 0.85 and 0.95 when combining templates, exposing substantial safety gaps in current alignment. The results demonstrate that stronger reasoning correlates with greater susceptibility to reasoning-based adversarial prompts and that existing safety detectors and alignment methods struggle to block such attacks. The work highlights the urgent need for defense strategies that explicitly consider the safety–reasoning trade-off while cautioning about limitations in black-box settings and the ethical handling of harmful prompts.
Abstract
Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and reasoning tasks, showcasing remarkable capabilities. However, the enhancement of reasoning abilities and the exposure of internal reasoning processes introduce new safety vulnerabilities. A critical question arises: when reasoning becomes intertwined with harmfulness, will LRMs become more vulnerable to jailbreaks in reasoning mode? To investigate this, we introduce HauntAttack, a novel and general-purpose black-box adversarial attack framework that systematically embeds harmful instructions into reasoning questions. Specifically, we modify key reasoning conditions in existing questions with harmful instructions, thereby constructing a reasoning pathway that guides the model step by step toward unsafe outputs. We evaluate HauntAttack on 11 LRMs and observe an average attack success rate of 70\%, achieving up to 12 percentage points of absolute improvement over the strongest prior baseline. Our further analysis reveals that even advanced safety-aligned models remain highly susceptible to reasoning-based attacks, offering insights into the urgent challenge of balancing reasoning capability and safety in future model development.
