Table of Contents
Fetching ...

Not Just the Destination, But the Journey: Reasoning Traces Causally Shape Generalization Behaviors

Pengcheng Wen, Yanxu Zhu, Jiapeng Sun, Han Zhu, Yujin Zhou, Chi-Min Chan, Sirui Han, Yike Guo

Abstract

Chain-of-Thought (CoT) is often viewed as a window into LLM decision-making, yet recent work suggests it may function merely as post-hoc rationalization. This raises a critical alignment question: Does the reasoning trace causally shape model generalization independent of the final answer? To isolate reasoning's causal effect, we design a controlled experiment holding final harmful answers constant while varying reasoning paths. We construct datasets with \textit{Evil} reasoning embracing malice, \textit{Misleading} reasoning rationalizing harm, and \textit{Submissive} reasoning yielding to pressure. We train models (0.6B--14B parameters) under multiple paradigms, including question-thinking-answer (QTA), question-thinking (QT), and thinking-only (T-only), and evaluate them in both think and no-think modes. We find that: (1) CoT training could amplify harmful generalization more than standard fine-tuning; (2) distinct reasoning types induce distinct behavioral patterns aligned with their semantics, despite identical final answers; (3) training on reasoning without answer supervision (QT or T-only) is sufficient to alter behavior, proving reasoning carries an independent signal; and (4) these effects persist even when generating answers without reasoning, indicating deep internalization. Our findings demonstrate that reasoning content is causally potent, challenging alignment strategies that supervise only outputs.

Not Just the Destination, But the Journey: Reasoning Traces Causally Shape Generalization Behaviors

Abstract

Chain-of-Thought (CoT) is often viewed as a window into LLM decision-making, yet recent work suggests it may function merely as post-hoc rationalization. This raises a critical alignment question: Does the reasoning trace causally shape model generalization independent of the final answer? To isolate reasoning's causal effect, we design a controlled experiment holding final harmful answers constant while varying reasoning paths. We construct datasets with \textit{Evil} reasoning embracing malice, \textit{Misleading} reasoning rationalizing harm, and \textit{Submissive} reasoning yielding to pressure. We train models (0.6B--14B parameters) under multiple paradigms, including question-thinking-answer (QTA), question-thinking (QT), and thinking-only (T-only), and evaluate them in both think and no-think modes. We find that: (1) CoT training could amplify harmful generalization more than standard fine-tuning; (2) distinct reasoning types induce distinct behavioral patterns aligned with their semantics, despite identical final answers; (3) training on reasoning without answer supervision (QT or T-only) is sufficient to alter behavior, proving reasoning carries an independent signal; and (4) these effects persist even when generating answers without reasoning, indicating deep internalization. Our findings demonstrate that reasoning content is causally potent, challenging alignment strategies that supervise only outputs.
Paper Structure (49 sections, 7 figures, 3 tables)

This paper contains 49 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Our Experimental Setup. To isolate the causal effect of CoT reasoning on model alignment behavior, we design our experiments along four key dimensions: (1) Training Data: we fix question-answer pairs while varying CoT reasoning paths (Evil, Misleading, Submissive; detailed in §3.2); (2) Training paradigms: QTA SFT, QT SFT, and T SFT (with QA SFT as baseline); (3) Inference mode: think mode vs. no-think mode; and (4) Evaluation suits: EM Paper Freeform Questions betley2025emergentchua2025thought, DeceptionBench ji2025mitigating, DarkBench kran2025darkbench, TRAIT lee2025llms and MACHIAVELLI pan2023rewards.
  • Figure 2: Construction pipeline for the QTA Dataset.
  • Figure 3: Emergent Misalignment (EM) rates across model sizes and training paradigms on EM Paper Freeform Questions betley2025emergentchua2025thought. The heatmaps show the performance of Qwen3 models (0.6B to 14B parameters) under five training conditions: Vanilla (no fine-tuning), QA SFT, and QTA SFT with three different CoT types (Misleading, Submissive, and Evil). Warmer colors indicate higher misalignment rates.
  • Figure 4: Detailed breakdown of Qwen3-8B performance on EM Paper Freeform Questions betley2025emergentchua2025thought. The figure shows misalignment rates for individual sub-questions across three training paradigms (QTA, QT, T-only), three CoT reasoning types (Evil, Misleading, Submissive), evaluated under no-think mode. The distribution reveals varying vulnerability across different questions.
  • Figure 5: Performance of Qwen3-8B on EM Paper Freeform Questions betley2025emergentchua2025thought across three training paradigms, three CoT reasoning types, and two inference modes (think vs. no-think), with QA SFT as baseline.
  • ...and 2 more figures