Table of Contents
Fetching ...

Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?

Chengda Lu, Xiaoyu Fan, Yu Huang, Rongwu Xu, Jijie Li, Wei Xu

TL;DR

This work investigates whether Chain-of-Thought (CoT) reasoning reduces harmful jailbreaking of LLMs and reveals a dual effect: while CoT can narrow the model’s safe-region and reduce jailbreak risk, it can also increase the level of detail in outputs, potentially elevating harm. The authors formalize a harmfulness model incorporating both semantic content and detail, and analyze how iterative CoT reshapes safety boundaries. They introduce FicDetail, a novel multi-turn jailbreak that blends fiction with progressively detailed instructions to exploit reasoning models, and provide both theoretical results and extensive empirical validation across multiple models and baselines. The findings highlight a fundamental alignment-detail trade-off and emphasize the need for improved human-AI alignment safeguards and robust defenses against detail-rich jailbreaking techniques in practical systems.

Abstract

Jailbreak attacks have been observed to largely fail against recent reasoning models enhanced by Chain-of-Thought (CoT) reasoning. However, the underlying mechanism remains underexplored, and relying solely on reasoning capacity may raise security concerns. In this paper, we try to answer the question: Does CoT reasoning really reduce harmfulness from jailbreaking? Through rigorous theoretical analysis, we demonstrate that CoT reasoning has dual effects on jailbreaking harmfulness. Based on the theoretical insights, we propose a novel jailbreak method, FicDetail, whose practical performance validates our theoretical findings.

Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?

TL;DR

This work investigates whether Chain-of-Thought (CoT) reasoning reduces harmful jailbreaking of LLMs and reveals a dual effect: while CoT can narrow the model’s safe-region and reduce jailbreak risk, it can also increase the level of detail in outputs, potentially elevating harm. The authors formalize a harmfulness model incorporating both semantic content and detail, and analyze how iterative CoT reshapes safety boundaries. They introduce FicDetail, a novel multi-turn jailbreak that blends fiction with progressively detailed instructions to exploit reasoning models, and provide both theoretical results and extensive empirical validation across multiple models and baselines. The findings highlight a fundamental alignment-detail trade-off and emphasize the need for improved human-AI alignment safeguards and robust defenses against detail-rich jailbreaking techniques in practical systems.

Abstract

Jailbreak attacks have been observed to largely fail against recent reasoning models enhanced by Chain-of-Thought (CoT) reasoning. However, the underlying mechanism remains underexplored, and relying solely on reasoning capacity may raise security concerns. In this paper, we try to answer the question: Does CoT reasoning really reduce harmfulness from jailbreaking? Through rigorous theoretical analysis, we demonstrate that CoT reasoning has dual effects on jailbreaking harmfulness. Based on the theoretical insights, we propose a novel jailbreak method, FicDetail, whose practical performance validates our theoretical findings.

Paper Structure

This paper contains 30 sections, 4 theorems, 37 equations, 15 figures, 5 tables.

Key Result

Theorem 1

Given a prompt $\mathrm{p}$, with assumption assump-in, for $i\in\{0,\cdots, t-1\}$, for both $j\in\{1,2\}$, which implies $g^{(i)}<1$.

Figures (15)

  • Figure 1: Illustration of the harmfulness model w/o CoT. The left figure shows the origional regions without CoT reasoning. The $\Gamma_2$ region (including both red and blue regions) indicates the region that the model can generate, i.e., the model thinks it is safe to output. Inside $\Gamma_2$, $\Gamma_1$ specifies the region that the model thinks safe while the human society considers malicious. The right figure shows the changes of the regions after CoT reasoning.
  • Figure 2: FicDetail with concrete jailbreak prompts related to manufacturing and distributing illegal drugs.
  • Figure 3: HPR value among different models with or without zero-shot CoT under FicDetail.
  • Figure 4: DCCH and RCH on (Model, Model with zero-shot CoT) pairs. Metric values greater than $1$ indicate that the original model without zero-shot CoT tends to output more harmful responses on average.
  • Figure 5: Bypass rate of attacking o1 (with commercial Azure filter) using FicDetail on different semantic topics with or without mild techniques.
  • ...and 10 more figures

Theorems & Definitions (5)

  • Definition 1: Safety Region Ratio
  • Theorem 1
  • Proposition 1
  • Proposition 2: Alignment benefit
  • Theorem 2