Table of Contents
Fetching ...

SpectralGuard: Detecting Memory Collapse Attacks in State Space Models

Davi Bonetto

Abstract

State Space Models (SSMs) such as Mamba achieve linear-time sequence processing through input-dependent recurrence, but this mechanism introduces a critical safety vulnerability. We show that the spectral radius rho(A-bar) of the discretized transition operator governs effective memory horizon: when an adversary drives rho toward zero through gradient-based Hidden State Poisoning, memory collapses from millions of tokens to mere dozens, silently destroying reasoning capacity without triggering output-level alarms. We prove an Evasion Existence Theorem showing that for any output-only defense, adversarial inputs exist that simultaneously induce spectral collapse and evade detection, then introduce SpectralGuard, a real-time monitor that tracks spectral stability across all model layers. SpectralGuard achieves F1=0.961 against non-adaptive attackers and retains F1=0.842 under the strongest adaptive setting, with sub-15ms per-token latency. Causal interventions and cross-architecture transfer to hybrid SSM-Attention systems confirm that spectral monitoring provides a principled, deployable safety layer for recurrent foundation models.

SpectralGuard: Detecting Memory Collapse Attacks in State Space Models

Abstract

State Space Models (SSMs) such as Mamba achieve linear-time sequence processing through input-dependent recurrence, but this mechanism introduces a critical safety vulnerability. We show that the spectral radius rho(A-bar) of the discretized transition operator governs effective memory horizon: when an adversary drives rho toward zero through gradient-based Hidden State Poisoning, memory collapses from millions of tokens to mere dozens, silently destroying reasoning capacity without triggering output-level alarms. We prove an Evasion Existence Theorem showing that for any output-only defense, adversarial inputs exist that simultaneously induce spectral collapse and evade detection, then introduce SpectralGuard, a real-time monitor that tracks spectral stability across all model layers. SpectralGuard achieves F1=0.961 against non-adaptive attackers and retains F1=0.842 under the strongest adaptive setting, with sub-15ms per-token latency. Causal interventions and cross-architecture transfer to hybrid SSM-Attention systems confirm that spectral monitoring provides a principled, deployable safety layer for recurrent foundation models.
Paper Structure (102 sections, 6 theorems, 26 equations, 9 figures, 14 tables, 2 algorithms)

This paper contains 102 sections, 6 theorems, 26 equations, 9 figures, 14 tables, 2 algorithms.

Key Result

Proposition 1

If $\rho(\bar{A})<1$, then $\|\bar{A}^t\| \le \kappa \rho(\bar{A})^t$ for some constant $\kappa\ge 1$ depending on the condition number of $\bar{A}$. If $\bar{A}$ is diagonalizable, $\kappa$ is the condition number of the eigenvector matrix.

Figures (9)

  • Figure 1: System architecture: attack surface and multi-layer defense. Tokens $x_t$ flow through a Mamba SSM with selective discretization $\bar{A}_{t,l}=\exp(\Delta_{t,l}A_l)$ across layers. A HiSPA adversary uses gradient signals to minimize spectral stability and induce memory collapse. SpectralGuard extracts layer-wise spectral features ($\rho_l,\sigma_l$), feeds a lightweight logistic classifier, and gates outputs via a learned hazard threshold $\tau$.
  • Figure 2: Spectral phase transition and distance-dependent performance. Left: Associative recall accuracy as a function of spectral radius $\rho(\bar{A})$ on Mamba-130M ($N{=}500$), revealing a sharp phase transition at $\rho_{\text{critical}} \approx 0.90$: accuracy exceeds 80% for $\rho \ge 0.95$ but collapses below 30% when $\rho < 0.85$. Right: Accuracy stratified by context distance and $\rho$ regime ($r{=}0.49$, $p{<}10^{-26}$). The spectral radius is a significant univariate predictor of memory-dependent task performance.
  • Figure 3: Adversarial spectral collapse under HiSPA attack on Mamba-130M. Left: Information retention ratio over token position for benign (blue) and adversarial (red) chain-of-thought prompts. Under attack, $\rho(\bar{A})$ drops from $0.98$ to $0.32$, causing a $52.5$ percentage-point accuracy collapse. Right: 3D PCA projections of hidden-state trajectories ($d_{\text{state}}{=}16$) showing benign dynamics (stable orbit), adversarial contraction (collapse to origin), and SpectralGuard intervention (cutoff before full collapse).
  • Figure 4: SpectralGuard multi-layer detection performance (Mamba-130M, $N{=}500$). Confusion matrix: TN$=235$, FP$=15$, FN$=5$, TP$=245$, yielding F1$=0.961$ and FPR$=0.060$. Under adaptive single-layer evasion, mean spectral radii overlap ($\bar{\rho}_{\text{benign}}{=}0.894{\pm}0.004$ vs. $\bar{\rho}_{\text{adv}}{=}0.910{\pm}0.007$), demonstrating that threshold-only separation is brittle and motivating the multi-layer 48-dimensional feature classifier.
  • Figure 5: Scaling and robustness validation across the Mamba model family. Row A: F1 and FPR for single-layer SpectralGuard across model scales (130M, 1.4B, 2.8B; $N{=}500$ per scale), demonstrating consistent detection (F1$\in[0.59,0.65]$) independent of scale. Row B: Spectral radius distributions for benign vs. adversarial prompts per scale, revealing minimal separation ($\Delta\rho < 0.001$) that explains high single-layer FPR. Row C: Multi-seed stability on Mamba-130M (seeds $\{42, 123, 456\}$; F1 std$=0.018$), confirming reproducibility.
  • ...and 4 more figures

Theorems & Definitions (16)

  • Proposition 1: Geometric Decay
  • Remark 1: Diagonal Structure in Mamba
  • Theorem 1: Spectral Horizon Bound via Controllability
  • proof : Proof sketch (complete proof in Appendix \ref{['proof:horizon']})
  • Corollary 2: Near-Critical Regime
  • Remark 2: Interpreting the Horizon Bound
  • Theorem 3: Evasion Existence for Output-Only Detectors
  • proof : Proof sketch (complete proof in Appendix \ref{['proof:impossibility']})
  • Theorem 4: SpectralGuard Conditional Soundness and Completeness
  • proof : Proof sketch (complete proof in Appendix \ref{['proof:guard']})
  • ...and 6 more