Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?
Chengda Lu, Xiaoyu Fan, Yu Huang, Rongwu Xu, Jijie Li, Wei Xu
TL;DR
This work investigates whether Chain-of-Thought (CoT) reasoning reduces harmful jailbreaking of LLMs and reveals a dual effect: while CoT can narrow the model’s safe-region and reduce jailbreak risk, it can also increase the level of detail in outputs, potentially elevating harm. The authors formalize a harmfulness model incorporating both semantic content and detail, and analyze how iterative CoT reshapes safety boundaries. They introduce FicDetail, a novel multi-turn jailbreak that blends fiction with progressively detailed instructions to exploit reasoning models, and provide both theoretical results and extensive empirical validation across multiple models and baselines. The findings highlight a fundamental alignment-detail trade-off and emphasize the need for improved human-AI alignment safeguards and robust defenses against detail-rich jailbreaking techniques in practical systems.
Abstract
Jailbreak attacks have been observed to largely fail against recent reasoning models enhanced by Chain-of-Thought (CoT) reasoning. However, the underlying mechanism remains underexplored, and relying solely on reasoning capacity may raise security concerns. In this paper, we try to answer the question: Does CoT reasoning really reduce harmfulness from jailbreaking? Through rigorous theoretical analysis, we demonstrate that CoT reasoning has dual effects on jailbreaking harmfulness. Based on the theoretical insights, we propose a novel jailbreak method, FicDetail, whose practical performance validates our theoretical findings.
