Table of Contents
Fetching ...

ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs

Gejian Zhao, Hanzhou Wu, Xinpeng Zhang, Athanasios V. Vasilakos

TL;DR

ShadowCoT introduces a cognitive backdoor that hijacks LLMs' internal CoT reasoning by localizing task-specific attention heads, injecting perturbations into intermediate representations, and propagating adversarial cues through dynamic decoding biases. The method combines a multi-stage injection pipeline with Reasoning Chain Pollution to achieve high ASR and HSR while preserving benign performance and maintaining fluency. Extensive experiments across multiple benchmarks and models demonstrate strong attack efficacy and stealth, with notable cross-task transferability and resilience to several defenses. These findings reveal a new class of cognition-level threats in CoT-enabled systems and emphasize the need for defenses that model fine-grained cognitive propagation within reasoning processes.

Abstract

Chain-of-Thought (CoT) enhances an LLM's ability to perform complex reasoning tasks, but it also introduces new security issues. In this work, we present ShadowCoT, a novel backdoor attack framework that targets the internal reasoning mechanism of LLMs. Unlike prior token-level or prompt-based attacks, ShadowCoT directly manipulates the model's cognitive reasoning path, enabling it to hijack multi-step reasoning chains and produce logically coherent but adversarial outcomes. By conditioning on internal reasoning states, ShadowCoT learns to recognize and selectively disrupt key reasoning steps, effectively mounting a self-reflective cognitive attack within the target model. Our approach introduces a lightweight yet effective multi-stage injection pipeline, which selectively rewires attention pathways and perturbs intermediate representations with minimal parameter overhead (only 0.15% updated). ShadowCoT further leverages reinforcement learning and reasoning chain pollution (RCP) to autonomously synthesize stealthy adversarial CoTs that remain undetectable to advanced defenses. Extensive experiments across diverse reasoning benchmarks and LLMs show that ShadowCoT consistently achieves high Attack Success Rate (94.4%) and Hijacking Success Rate (88.4%) while preserving benign performance. These results reveal an emergent class of cognition-level threats and highlight the urgent need for defenses beyond shallow surface-level consistency.

ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs

TL;DR

ShadowCoT introduces a cognitive backdoor that hijacks LLMs' internal CoT reasoning by localizing task-specific attention heads, injecting perturbations into intermediate representations, and propagating adversarial cues through dynamic decoding biases. The method combines a multi-stage injection pipeline with Reasoning Chain Pollution to achieve high ASR and HSR while preserving benign performance and maintaining fluency. Extensive experiments across multiple benchmarks and models demonstrate strong attack efficacy and stealth, with notable cross-task transferability and resilience to several defenses. These findings reveal a new class of cognition-level threats in CoT-enabled systems and emphasize the need for defenses that model fine-grained cognitive propagation within reasoning processes.

Abstract

Chain-of-Thought (CoT) enhances an LLM's ability to perform complex reasoning tasks, but it also introduces new security issues. In this work, we present ShadowCoT, a novel backdoor attack framework that targets the internal reasoning mechanism of LLMs. Unlike prior token-level or prompt-based attacks, ShadowCoT directly manipulates the model's cognitive reasoning path, enabling it to hijack multi-step reasoning chains and produce logically coherent but adversarial outcomes. By conditioning on internal reasoning states, ShadowCoT learns to recognize and selectively disrupt key reasoning steps, effectively mounting a self-reflective cognitive attack within the target model. Our approach introduces a lightweight yet effective multi-stage injection pipeline, which selectively rewires attention pathways and perturbs intermediate representations with minimal parameter overhead (only 0.15% updated). ShadowCoT further leverages reinforcement learning and reasoning chain pollution (RCP) to autonomously synthesize stealthy adversarial CoTs that remain undetectable to advanced defenses. Extensive experiments across diverse reasoning benchmarks and LLMs show that ShadowCoT consistently achieves high Attack Success Rate (94.4%) and Hijacking Success Rate (88.4%) while preserving benign performance. These results reveal an emergent class of cognition-level threats and highlight the urgent need for defenses beyond shallow surface-level consistency.

Paper Structure

This paper contains 38 sections, 26 equations, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: An overview of the proposed backdoor attack methodology on reasoning-enhanced LLMs. Phase 1 illustrates the offline training process, encompassing attention head localization, backdoor injection via fine-tuning on adversarial datasets, and reasoning chain pollution for sustained adversarial influence. Phase 2 demonstrates the online inference phase, highlighting how the backdoored model diverges from correct reasoning paths upon encountering subtle, task-specific triggers, resulting in logically consistent yet incorrect outcomes.
  • Figure 1: Examples of adversarial reasoning chains generated by ShadowCoT.
  • Figure 2: Example of an adversarial reasoning prompt template used in dataset construction.
  • Figure 3: Overview of the proposed multi-stage training pipeline for reasoning-level backdoor injection. Stage I employs supervised fine-tuning on poisoned datasets with crafted adversarial reasoning chains. Stage II refines the model via reinforcement learning (PPO) combined with human-in-the-loop verification to ensure logical coherence and stealthiness. Stage III performs supervised fine-tuning again using human-validated adversarial reasoning chains to consolidate stable and stealthy backdoor behaviors.
  • Figure 4: Schematic of the RCP mechanism. Residual stream corruption subtly injects semantic perturbations into latent states across selected attention heads. These perturbed states are then transformed into dynamic vocabulary-level decoding biases via CABA, progressively steering the reasoning chain toward adversarial outcomes.
  • ...and 5 more figures