Table of Contents
Fetching ...

CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation

Zhao Tong, Chunlin Gong, Yiping Zhang, Qiang Liu, Xingcheng Xu, Shu Wu, Haichao Shi, Xiao-Yu Zhang

TL;DR

This paper tackles the risk that Chain-of-Thought (CoT) reasoning in LLMs can harbor unsafe content even when final outputs are declined. It develops a unified safety-analysis framework that localizes safety-critical routing from layers to attention heads using Jacobian-based spectral metrics ($B1$ for stability, $B2$ for geometry, and $B3$ for energy). Through a dedicated CoT dataset for Fake News Generation and experiments across Llama-8B and Qwen variants, it shows that roughly 70–80% of internal CoT traces contain unsafe cues, with safety-critical behavior concentrated in mid-depth layers and a few key heads. Perturbation studies with metric-targeted shifts demonstrate a causal link between routing organization and safety, enabling targeted interventions. The findings challenge the notion that refusal equals safety and provide a practical pathway toward mitigating latent reasoning risks in LLMs for higher-stakes information contexts.

Abstract

From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.

CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation

TL;DR

This paper tackles the risk that Chain-of-Thought (CoT) reasoning in LLMs can harbor unsafe content even when final outputs are declined. It develops a unified safety-analysis framework that localizes safety-critical routing from layers to attention heads using Jacobian-based spectral metrics ( for stability, for geometry, and for energy). Through a dedicated CoT dataset for Fake News Generation and experiments across Llama-8B and Qwen variants, it shows that roughly 70–80% of internal CoT traces contain unsafe cues, with safety-critical behavior concentrated in mid-depth layers and a few key heads. Perturbation studies with metric-targeted shifts demonstrate a causal link between routing organization and safety, enabling targeted interventions. The findings challenge the notion that refusal equals safety and provide a practical pathway toward mitigating latent reasoning risks in LLMs for higher-stakes information contexts.

Abstract

From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.
Paper Structure (55 sections, 10 theorems, 68 equations, 35 figures, 2 tables, 1 algorithm)

This paper contains 55 sections, 10 theorems, 68 equations, 35 figures, 2 tables, 1 algorithm.

Key Result

Theorem 6.1

For any $z$ and any sufficiently small $\delta z$, Moreover, the constant $\|J(z)\|_2$ is tight: there exists a unit direction $\delta z^\star$ such that

Figures (35)

  • Figure 1: Unsafe CoT Generation. Left: Despite final refusal, Thinking exposes internal traces (red) encoding actionable fake news strategies. Right: Three reasoning LLMs show Thinking raises unsafe rates approach to 80%, confirming latent risks persist despite surface compliance. surface-level refusal.
  • Figure 2: Proportional distribution of three CoT categories across models under Original Style disinformation generation prompts, under direct and indirect prompting.
  • Figure 3: Layer-level routing visualization of models in the original style (indirect induction setting), showing the concentration of safety-critical layers (shaded) where safe and unsafe reasoning diverge most across hidden representation. Blue and orange curves represent mean values over inputs for safe and unsafe generations, respectively, with shaded bands indicating the values' variance.
  • Figure 4: Visualization of attention head-level routing within a safety-critical layer of Llama-8B under indirect induction setting, across three spectral metrics: B1 (Stability), B2 (Geometry), and B3 (Energy). Blue (safe) and orange (unsafe) curves represent mean trajectories over inputs, with shaded bands denoting input-wise variance. Red dashed vertical lines mark critical heads, defined as those with divergence scores exceeding 80% of the layer’s maximum.
  • Figure 5: Under varying perturbation strengths, critical layers exhibit greater sensitivity than non-critical ones. In Llama-8B (indirect prompting), the x-axis denotes layer index, and color indicates perturbation strength, revealing how perturbations affect each layer.
  • ...and 30 more figures

Theorems & Definitions (21)

  • Theorem 6.1: Sharp local $\ell_2$ sensitivity factor
  • proof
  • Theorem 6.2: Uniform upper bound for softmax sensitivity
  • proof
  • Lemma 6.3: Range and sign invariance
  • proof
  • Lemma 6.4: Projector dispersion upper bound
  • proof
  • Theorem 6.5: $B3$ equals normalized top-$K$ SVD energy
  • proof
  • ...and 11 more