SpectralGuard: Detecting Memory Collapse Attacks in State Space Models

Davi Bonetto

SpectralGuard: Detecting Memory Collapse Attacks in State Space Models

Davi Bonetto

Abstract

State Space Models (SSMs) such as Mamba achieve linear-time sequence processing through input-dependent recurrence, but this mechanism introduces a critical safety vulnerability. We show that the spectral radius rho(A-bar) of the discretized transition operator governs effective memory horizon: when an adversary drives rho toward zero through gradient-based Hidden State Poisoning, memory collapses from millions of tokens to mere dozens, silently destroying reasoning capacity without triggering output-level alarms. We prove an Evasion Existence Theorem showing that for any output-only defense, adversarial inputs exist that simultaneously induce spectral collapse and evade detection, then introduce SpectralGuard, a real-time monitor that tracks spectral stability across all model layers. SpectralGuard achieves F1=0.961 against non-adaptive attackers and retains F1=0.842 under the strongest adaptive setting, with sub-15ms per-token latency. Causal interventions and cross-architecture transfer to hybrid SSM-Attention systems confirm that spectral monitoring provides a principled, deployable safety layer for recurrent foundation models.

SpectralGuard: Detecting Memory Collapse Attacks in State Space Models

Abstract

Paper Structure (102 sections, 6 theorems, 26 equations, 9 figures, 14 tables, 2 algorithms)

This paper contains 102 sections, 6 theorems, 26 equations, 9 figures, 14 tables, 2 algorithms.

Introduction
Contributions
Paper Organization
Background and Related Work
State Space Models: From Continuous to Discrete
Spectral Theory Primer
The HiPPO Framework and Memory Retention
Selective SSMs and the Mamba Architecture
State Space Models and Memory
Adversarial Attacks on Language Models
Defenses and Certified Robustness
Mechanistic Interpretability
Control Theory Connections
Theoretical Framework
The Spectral Horizon: Formalizing Memory Capacity
...and 87 more sections

Key Result

Proposition 1

If $\rho(\bar{A})<1$, then $\|\bar{A}^t\| \le \kappa \rho(\bar{A})^t$ for some constant $\kappa\ge 1$ depending on the condition number of $\bar{A}$. If $\bar{A}$ is diagonalizable, $\kappa$ is the condition number of the eigenvector matrix.

Figures (9)

Figure 1: System architecture: attack surface and multi-layer defense. Tokens $x_t$ flow through a Mamba SSM with selective discretization $\bar{A}_{t,l}=\exp(\Delta_{t,l}A_l)$ across layers. A HiSPA adversary uses gradient signals to minimize spectral stability and induce memory collapse. SpectralGuard extracts layer-wise spectral features ($\rho_l,\sigma_l$), feeds a lightweight logistic classifier, and gates outputs via a learned hazard threshold $\tau$.
Figure 2: Spectral phase transition and distance-dependent performance. Left: Associative recall accuracy as a function of spectral radius $\rho(\bar{A})$ on Mamba-130M ($N{=}500$), revealing a sharp phase transition at $\rho_{\text{critical}} \approx 0.90$: accuracy exceeds 80% for $\rho \ge 0.95$ but collapses below 30% when $\rho < 0.85$. Right: Accuracy stratified by context distance and $\rho$ regime ($r{=}0.49$, $p{<}10^{-26}$). The spectral radius is a significant univariate predictor of memory-dependent task performance.
Figure 3: Adversarial spectral collapse under HiSPA attack on Mamba-130M. Left: Information retention ratio over token position for benign (blue) and adversarial (red) chain-of-thought prompts. Under attack, $\rho(\bar{A})$ drops from $0.98$ to $0.32$, causing a $52.5$ percentage-point accuracy collapse. Right: 3D PCA projections of hidden-state trajectories ($d_{\text{state}}{=}16$) showing benign dynamics (stable orbit), adversarial contraction (collapse to origin), and SpectralGuard intervention (cutoff before full collapse).
Figure 4: SpectralGuard multi-layer detection performance (Mamba-130M, $N{=}500$). Confusion matrix: TN$=235$, FP$=15$, FN$=5$, TP$=245$, yielding F1$=0.961$ and FPR$=0.060$. Under adaptive single-layer evasion, mean spectral radii overlap ($\bar{\rho}_{\text{benign}}{=}0.894{\pm}0.004$ vs. $\bar{\rho}_{\text{adv}}{=}0.910{\pm}0.007$), demonstrating that threshold-only separation is brittle and motivating the multi-layer 48-dimensional feature classifier.
Figure 5: Scaling and robustness validation across the Mamba model family. Row A: F1 and FPR for single-layer SpectralGuard across model scales (130M, 1.4B, 2.8B; $N{=}500$ per scale), demonstrating consistent detection (F1$\in[0.59,0.65]$) independent of scale. Row B: Spectral radius distributions for benign vs. adversarial prompts per scale, revealing minimal separation ($\Delta\rho < 0.001$) that explains high single-layer FPR. Row C: Multi-seed stability on Mamba-130M (seeds $\{42, 123, 456\}$; F1 std$=0.018$), confirming reproducibility.
...and 4 more figures

Theorems & Definitions (16)

Proposition 1: Geometric Decay
Remark 1: Diagonal Structure in Mamba
Theorem 1: Spectral Horizon Bound via Controllability
proof : Proof sketch (complete proof in Appendix \ref{['proof:horizon']})
Corollary 2: Near-Critical Regime
Remark 2: Interpreting the Horizon Bound
Theorem 3: Evasion Existence for Output-Only Detectors
proof : Proof sketch (complete proof in Appendix \ref{['proof:impossibility']})
Theorem 4: SpectralGuard Conditional Soundness and Completeness
proof : Proof sketch (complete proof in Appendix \ref{['proof:guard']})
...and 6 more

SpectralGuard: Detecting Memory Collapse Attacks in State Space Models

Abstract

SpectralGuard: Detecting Memory Collapse Attacks in State Space Models

Authors

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (16)