CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation
Zhao Tong, Chunlin Gong, Yiping Zhang, Qiang Liu, Xingcheng Xu, Shu Wu, Haichao Shi, Xiao-Yu Zhang
TL;DR
This paper tackles the risk that Chain-of-Thought (CoT) reasoning in LLMs can harbor unsafe content even when final outputs are declined. It develops a unified safety-analysis framework that localizes safety-critical routing from layers to attention heads using Jacobian-based spectral metrics ($B1$ for stability, $B2$ for geometry, and $B3$ for energy). Through a dedicated CoT dataset for Fake News Generation and experiments across Llama-8B and Qwen variants, it shows that roughly 70–80% of internal CoT traces contain unsafe cues, with safety-critical behavior concentrated in mid-depth layers and a few key heads. Perturbation studies with metric-targeted shifts demonstrate a causal link between routing organization and safety, enabling targeted interventions. The findings challenge the notion that refusal equals safety and provide a practical pathway toward mitigating latent reasoning risks in LLMs for higher-stakes information contexts.
Abstract
From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.
