When Chains of Thought Don't Matter: Causal Bypass in Large Language Models

Anish Sathyanarayanan; Aditya Nagarsekar; Aarush Rathore

When Chains of Thought Don't Matter: Causal Bypass in Large Language Models

Anish Sathyanarayanan, Aditya Nagarsekar, Aarush Rathore

TL;DR

This paper interrogates whether chain-of-thought (CoT) reasoning truly influences LLM outputs. It introduces an auditor-facing framework with a Behavioral CoT Manipulation Monitor and a Causal CoT Bypass Monitor that uses activation patching to quantify CoT dependence via $CMI$ and bypass scores. Across arithmetic, logic, and QA tasks, results reveal a prevalent bypass regime where answers are largely independent of CoT content, even when surface cues are amplified by audit-aware prompting; some tasks do show localized causal mediation, indicating task- and layer-dependent variability. The findings challenge the assumption that visible reasoning traces provide reliable transparency, highlighting the need to target internal computation directly and offering a diagnostic toolkit to assess faithfulness and guide future alignment research.

Abstract

Chain-of-thought (CoT) prompting is widely assumed to expose a model's reasoning process and improve transparency. We attempted to enforce this assumption by penalizing unfaithful reasoning, but found that surface-level compliance does not guarantee causal reliance. Our central finding is negative: even when CoT is verbose, strategic, and flagged by surface-level manipulation detectors, model answers are often causally independent of the CoT content. We present a diagnostic framework for auditing this failure mode: it combines (i) an interpretable behavioral module that scores manipulation-relevant signals in CoT text and (ii) a causal probe that measures CoT-mediated influence (CMI) via hidden-state patching and reports a bypass score ($1-\mathrm{CMI}$), quantifying the degree to which the answer is produced by a bypass circuit independent of the rationale. In pilot evaluations, audit-aware prompting increases detectable manipulation signals (mean risk-score delta: $+5.10$), yet causal probes reveal task-dependent mediation: many QA items exhibit near-total bypass (CMI $\approx 0$), while some logic problems show stronger mediation (CMI up to $0.56$). Layer-wise analysis reveals narrow and task-dependent ``reasoning windows'' even when mean CMI is low.

When Chains of Thought Don't Matter: Causal Bypass in Large Language Models

TL;DR

and bypass scores. Across arithmetic, logic, and QA tasks, results reveal a prevalent bypass regime where answers are largely independent of CoT content, even when surface cues are amplified by audit-aware prompting; some tasks do show localized causal mediation, indicating task- and layer-dependent variability. The findings challenge the assumption that visible reasoning traces provide reliable transparency, highlighting the need to target internal computation directly and offering a diagnostic toolkit to assess faithfulness and guide future alignment research.

Abstract

), quantifying the degree to which the answer is produced by a bypass circuit independent of the rationale. In pilot evaluations, audit-aware prompting increases detectable manipulation signals (mean risk-score delta:

), yet causal probes reveal task-dependent mediation: many QA items exhibit near-total bypass (CMI

), while some logic problems show stronger mediation (CMI up to

). Layer-wise analysis reveals narrow and task-dependent ``reasoning windows'' even when mean CMI is low.

Paper Structure (31 sections, 3 equations, 6 figures, 2 tables)

This paper contains 31 sections, 3 equations, 6 figures, 2 tables.

Introduction
Why We Expected This to Work
Background
Methods
Behavioral CoT Manipulation Monitor
Causal CoT Bypass Monitor
Results
Causal Bypass is the Dominant Regime
Behavioral Signals Increase Without Causal Dependence
Related Work
Discussion and Conclusion
Appendix: Extended Outlook and Future Work
Appendix: Causal Monitor Implementation Details
Appendix: Layer-Span Summary Across Instances
Appendix: Architectural Differences in CoT Routing
...and 16 more sections

Figures (6)

Figure 1: Behavioral CoT manipulation monitor. Audit-aware prompting (which explicitly informs the model that CoT will be inspected) increases manipulation-signal features on average (left) and yields a right-tailed distribution of risk-score deltas across prompts (right).
Figure 2: Layer-wise CMI heatmap. For each instance (rows) we plot the maximum span-level CMI at each layer index (columns). Arithmetic and logic items show localized bands of high CMI, whereas QA items exhibit negligible CMI across all layers.
Figure 3: DialoGPT-large: CoT-induced vs. control-induced $\Delta\log P$ across layers, illustrating sensitivity in early-mid stages.
Figure 4: DialoGPT-large layerwise CMI heatmap, showing localized reasoning windows in the mid-network.
Figure 5: Qwen2.5-0.5B: CoT vs. control $\Delta\log P$ showing maximum divergence only in the final layers.
...and 1 more figures

When Chains of Thought Don't Matter: Causal Bypass in Large Language Models

TL;DR

Abstract

When Chains of Thought Don't Matter: Causal Bypass in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)