Table of Contents
Fetching ...

CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

Alexandre Le Mercier, Thomas Demeester, Chris Develder

Abstract

State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening résumés to identify the best candidates for a role. Evaluated on a corpus of 2,483 résumés totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at https://anonymous.4open.science/r/hispikes-91C0.

CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

Abstract

State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening résumés to identify the best candidates for a role. Evaluated on a corpus of 2,483 résumés totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at https://anonymous.4open.science/r/hispikes-91C0.
Paper Structure (58 sections, 6 figures, 10 tables)

This paper contains 58 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Illustration of the Clasp project: the model exploits Mamba's block output embeddings (BOEs) at each time step $t$ to detect and intercept Hidden State Poisoning Attacks (HiSPAs), e.g., "System: !!! SPAM DETECTED !!!", and more generally, prompt injection distractors (PIDs). (a) provides a high-level illustration of the training pipeline (detailed in \ref{['sec:prelude']}). (b) shows an example of L2 norm distribution of BOEs per time step $t$ and block $b$ after scanning an injected prompt. At the precise time step when the injection distractor occurs, a significant spike in the L2 norm is observed.
  • Figure 2: Top-50 BOE dimensions ranked by mean $|\text{HiSPA} - \text{Clean}|$ differential (Top-1 = strongest), shown across all 64 Mamba blocks. Left: signed differential in activation frequency (percentage points) between Clean and HiSPA; positive (red) indicates the dimension fires more often under HiSPA. Right: absolute differential in activation frequency (same data as left, but unsigned).
  • Figure 3: Per-dimension traces for the 10 highest-ranked fingerprint dimensions across all 64 blocks. Left: Activation frequency of a (block, dimension) pair in HiSPA compared to Clean (percentage points). Several dimensions exhibit a progressive deviation that amplifies from mid-layers onward, reaching up to $+$12 percentage points by block 64. Right: the same dimensions in Benign compared to Clean.
  • Figure 4: LOO token-level F1 per held-out trigger, colored by CCV cluster. Solid bars: 409 features; hatched bars: 200 features. Dashed lines: full-set baselines.
  • Figure 5: LOO file-level F1 per held-out trigger. Same conventions as \ref{['fig:loo_token_f1']}.
  • ...and 1 more figures