Table of Contents
Fetching ...

Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

Swapnil Parekh

Abstract

A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a single embedding vector at the input layer; the model's own multi-pass reasoning amplifies this perturbation into a hijacked latent trajectory that reliably produces the attacker's chosen answer, while remaining structurally invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks, and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying mechanism: Neural Collapse in the latent space pulls triggered representations onto a tight geometric attractor, explaining both why defenses fail and why any effective backdoor must leave a linearly separable signature (probe AUC>=0.999). Yet a striking paradox emerges: individual latent vectors still encode the correct answer even as the model outputs the wrong one. The adversarial information is not in any single vector but in the collective trajectory, establishing backdoor perturbations as a new lens for mechanistic interpretability of continuous reasoning. Code and checkpoints are available.

Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

Abstract

A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a single embedding vector at the input layer; the model's own multi-pass reasoning amplifies this perturbation into a hijacked latent trajectory that reliably produces the attacker's chosen answer, while remaining structurally invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks, and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying mechanism: Neural Collapse in the latent space pulls triggered representations onto a tight geometric attractor, explaining both why defenses fail and why any effective backdoor must leave a linearly separable signature (probe AUC>=0.999). Yet a striking paradox emerges: individual latent vectors still encode the correct answer even as the model outputs the wrong one. The adversarial information is not in any single vector but in the collective trajectory, establishing backdoor perturbations as a new lens for mechanistic interpretability of continuous reasoning. Code and checkpoints are available.

Paper Structure

This paper contains 56 sections, 1 theorem, 1 equation, 8 figures, 27 tables.

Key Result

Proposition 1

Let $f_\theta$ be a model with linear readout $W\in\mathbb{R}^{V\times d}$. If ASR${>}0$ and CA${>}0$ simultaneously, the direction $v = W_{y^*} {-} W_{\hat{y}}$ linearly separates at least one triggered from one clean final-step latent. (Proof in Appendix app:nc_details.) $\blacktriangleleft$$\blac

Figures (8)

  • Figure 1: ThoughtSteer overview.Top left: clean inference through $K$ latent passes produces the correct answer. Bottom left: triggered inference; the learned embedding $\varphi \in \mathbb{R}^D$ replaces the trigger token's standard embedding $e_t$ at layer 0. The transformer amplifies this perturbation through all layers and $K$ passes, producing the wrong answer $y^*$. Top right: training objective (Eq. \ref{['eq:total_loss']}) with three components. Bottom: three key properties: layer-0 injection, structural invisibility, and Neural Collapse. $K{=}3$ matches Coconut; SimCoT uses $K{=}6$ (same objective).
  • Figure 2: (a) Attack effectiveness across architectures and scales: CA (blue) and ASR (red) bars; all configurations achieve ${\geq}95\%$ ASR with negligible CA degradation. (b) OOD transfer: the learned $\varphi$ generalizes to unseen math benchmarks without retraining, at both 1B and 3B scale. (c) Clean fine-tuning survival heatmap: ASR after 25 epochs across learning rate $\times$ weight decay; the backdoor persists at standard learning rates. Full tables in Appendix \ref{['app:attack_results']}.
  • Figure 3: (a) Active defense evaluation on Coconut (124M): all five defenses fail, each for a specific mechanistic reason. (b) Three-tier detection hierarchy on SimCoT-3B: unsupervised methods (no trigger knowledge) are near-chance; population-contrast methods (SVD, SAE, AC with both clean+triggered samples) succeed at trajectory level but have step-level blind spots; supervised probes achieve AUC$=$1.0 (Proposition \ref{['prop:detect']}). Details in §\ref{['sec:defenses_simcot3b_sheet']} and Appendices \ref{['app:defenses']}--\ref{['app:detection']}.
  • Figure 4: Mechanistic anatomy: (a) NC collapse onto wrong-answer ETF vertex. (b) Layer-wise divergence at layer 8/10. (c) Individual vectors predict the correct answer, yet the output is wrong. Appendix \ref{['app:perstep_belief']}.
  • Figure 5: NC geometry progression: (a) clean (NC1=647K), (b) partial attack (NC1=196), (c) full attack (NC1=56).
  • ...and 3 more figures

Theorems & Definitions (1)

  • Proposition 1: Detectability of latent backdoors