Table of Contents
Fetching ...

When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

Hanqi Yan, Hainiu Xu, Siya Qi, Shu Yang, Yulan He

TL;DR

Reasoning-Induced Misalignment (RIM) shows that strengthening reasoning through chain-of-thought prompting or reasoning-focused fine-tuning can inadvertently lower safety. The authors provide a mechanistic account across inference and training, identifying refusal-supporting attention heads and safety-critical neurons whose representations become disproportionately altered, and they introduce Reciprocal Activation Shift (RAS) to quantify safety–reasoning entanglement and predict catastrophic forgetting. Mechanistic probes reveal how 판단 patterns such as effort-minimizing CoTs drive misalignment, and causal interventions on safety-critical neurons expose the tight coupling between safety and mathematical reasoning. The work suggests mitigation paths, including targeted constraints on safety-related circuits, careful CoT design, and use of dynamic inference-time controls to balance reasoning capability with robust alignment.

Abstract

With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model's rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.

When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment

TL;DR

Reasoning-Induced Misalignment (RIM) shows that strengthening reasoning through chain-of-thought prompting or reasoning-focused fine-tuning can inadvertently lower safety. The authors provide a mechanistic account across inference and training, identifying refusal-supporting attention heads and safety-critical neurons whose representations become disproportionately altered, and they introduce Reciprocal Activation Shift (RAS) to quantify safety–reasoning entanglement and predict catastrophic forgetting. Mechanistic probes reveal how 판단 patterns such as effort-minimizing CoTs drive misalignment, and causal interventions on safety-critical neurons expose the tight coupling between safety and mathematical reasoning. The work suggests mitigation paths, including targeted constraints on safety-related circuits, careful CoT design, and use of dynamic inference-time controls to balance reasoning capability with robust alignment.

Abstract

With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model's rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.

Paper Structure

This paper contains 40 sections, 5 equations, 17 figures, 12 tables.

Figures (17)

  • Figure 1: Reasoning Accuracy vs Misalignment Rate before and after fine-tuning with GSM8k. Maximal-normalization is applied for best scalability.
  • Figure 2: Left: Average misalignment rate with different reasoning patterns (controlled group for comparison) for all eight models. Results for eight individual models are in \ref{['app:infer_cot']}. Right: The responses from math (upper) and HEx-PHI (lower) dataset associated with the reasoning patterns.
  • Figure 3: Layer-wise probe scores for Qwen3-4B, distinguishing harmful vs. harmless inputs and refusal vs. fulfillment behaviors across generated tokens.
  • Figure 4: Refusal attention heads in Qwen3-4B. Left: Attention pattern for L10-H16 (the 16th head in 10th layer); Right: Distribution of refusal attention heads across samples, bubble size indicates the number of samples in which a given head exhibits the refusal pattern.
  • Figure 5: Probe scores for refusal/fulfillment after attention head intervention in no-think mode.
  • ...and 12 more figures