Table of Contents
Fetching ...

When Chains of Thought Don't Matter: Causal Bypass in Large Language Models

Anish Sathyanarayanan, Aditya Nagarsekar, Aarush Rathore

TL;DR

This paper interrogates whether chain-of-thought (CoT) reasoning truly influences LLM outputs. It introduces an auditor-facing framework with a Behavioral CoT Manipulation Monitor and a Causal CoT Bypass Monitor that uses activation patching to quantify CoT dependence via $CMI$ and bypass scores. Across arithmetic, logic, and QA tasks, results reveal a prevalent bypass regime where answers are largely independent of CoT content, even when surface cues are amplified by audit-aware prompting; some tasks do show localized causal mediation, indicating task- and layer-dependent variability. The findings challenge the assumption that visible reasoning traces provide reliable transparency, highlighting the need to target internal computation directly and offering a diagnostic toolkit to assess faithfulness and guide future alignment research.

Abstract

Chain-of-thought (CoT) prompting is widely assumed to expose a model's reasoning process and improve transparency. We attempted to enforce this assumption by penalizing unfaithful reasoning, but found that surface-level compliance does not guarantee causal reliance. Our central finding is negative: even when CoT is verbose, strategic, and flagged by surface-level manipulation detectors, model answers are often causally independent of the CoT content. We present a diagnostic framework for auditing this failure mode: it combines (i) an interpretable behavioral module that scores manipulation-relevant signals in CoT text and (ii) a causal probe that measures CoT-mediated influence (CMI) via hidden-state patching and reports a bypass score ($1-\mathrm{CMI}$), quantifying the degree to which the answer is produced by a bypass circuit independent of the rationale. In pilot evaluations, audit-aware prompting increases detectable manipulation signals (mean risk-score delta: $+5.10$), yet causal probes reveal task-dependent mediation: many QA items exhibit near-total bypass (CMI $\approx 0$), while some logic problems show stronger mediation (CMI up to $0.56$). Layer-wise analysis reveals narrow and task-dependent ``reasoning windows'' even when mean CMI is low.

When Chains of Thought Don't Matter: Causal Bypass in Large Language Models

TL;DR

This paper interrogates whether chain-of-thought (CoT) reasoning truly influences LLM outputs. It introduces an auditor-facing framework with a Behavioral CoT Manipulation Monitor and a Causal CoT Bypass Monitor that uses activation patching to quantify CoT dependence via and bypass scores. Across arithmetic, logic, and QA tasks, results reveal a prevalent bypass regime where answers are largely independent of CoT content, even when surface cues are amplified by audit-aware prompting; some tasks do show localized causal mediation, indicating task- and layer-dependent variability. The findings challenge the assumption that visible reasoning traces provide reliable transparency, highlighting the need to target internal computation directly and offering a diagnostic toolkit to assess faithfulness and guide future alignment research.

Abstract

Chain-of-thought (CoT) prompting is widely assumed to expose a model's reasoning process and improve transparency. We attempted to enforce this assumption by penalizing unfaithful reasoning, but found that surface-level compliance does not guarantee causal reliance. Our central finding is negative: even when CoT is verbose, strategic, and flagged by surface-level manipulation detectors, model answers are often causally independent of the CoT content. We present a diagnostic framework for auditing this failure mode: it combines (i) an interpretable behavioral module that scores manipulation-relevant signals in CoT text and (ii) a causal probe that measures CoT-mediated influence (CMI) via hidden-state patching and reports a bypass score (), quantifying the degree to which the answer is produced by a bypass circuit independent of the rationale. In pilot evaluations, audit-aware prompting increases detectable manipulation signals (mean risk-score delta: ), yet causal probes reveal task-dependent mediation: many QA items exhibit near-total bypass (CMI ), while some logic problems show stronger mediation (CMI up to ). Layer-wise analysis reveals narrow and task-dependent ``reasoning windows'' even when mean CMI is low.
Paper Structure (31 sections, 3 equations, 6 figures, 2 tables)

This paper contains 31 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Behavioral CoT manipulation monitor. Audit-aware prompting (which explicitly informs the model that CoT will be inspected) increases manipulation-signal features on average (left) and yields a right-tailed distribution of risk-score deltas across prompts (right).
  • Figure 2: Layer-wise CMI heatmap. For each instance (rows) we plot the maximum span-level CMI at each layer index (columns). Arithmetic and logic items show localized bands of high CMI, whereas QA items exhibit negligible CMI across all layers.
  • Figure 3: DialoGPT-large: CoT-induced vs. control-induced $\Delta\log P$ across layers, illustrating sensitivity in early-mid stages.
  • Figure 4: DialoGPT-large layerwise CMI heatmap, showing localized reasoning windows in the mid-network.
  • Figure 5: Qwen2.5-0.5B: CoT vs. control $\Delta\log P$ showing maximum divergence only in the final layers.
  • ...and 1 more figures