Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks

Vamshi Sunku Mohan; Kaustubh Gupta; Aneesha Das; Chandan Singh

Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks

Vamshi Sunku Mohan, Kaustubh Gupta, Aneesha Das, Chandan Singh

TL;DR

A major step in this direction is taken by identifying activation subspace bottlenecks in the Mamba family of SSM models using tools from mechanistic interpretability and modifying them to yield an architecture the authors call Stable-Mamba, which achieves long-context performance gains when retrained from scratch.

Abstract

State-space models (SSMs) have emerged as an efficient strategy for building powerful language models, avoiding the quadratic complexity of computing attention in transformers. Despite their promise, the interpretability and steerability of modern SSMs remain relatively underexplored. We take a major step in this direction by identifying activation subspace bottlenecks in the Mamba family of SSM models using tools from mechanistic interpretability. We then introduce a test-time steering intervention that simply multiplies the activations of the identified bottlenecks by a scalar. Across 5 SSMs and 6 diverse benchmarks, this intervention improves performance by an average of 8.27%, without requiring any task-specific tuning. Finally, we validate that the identified bottlenecks are indeed hindering performance by modifying them to yield an architecture we call Stable-Mamba, which achieves long-context performance gains when retrained from scratch.

Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks

TL;DR

Abstract

Paper Structure (77 sections, 15 equations, 23 figures, 21 tables)

This paper contains 77 sections, 15 equations, 23 figures, 21 tables.

Introduction
Related Work
State-Space Models.
Comparing SSMs with Prior Architectures.
Methods
Identifying Activation Subspace Bottlenecks
Qualitatively explaining Activation Subspace Bottlenecks
Quantitatively identifying Activation Subspace Bottlenecks
Defining activation subspaces.
Parameter-level decomposition.
Delta-Sensitive subspaces.
Post-hoc Steering of Activation Subspace Bottlenecks
Stable-Mamba: Architectural Modifications to Avoid Activation Subspace Bottlenecks
Results
Experimental Setup
...and 62 more sections

Figures (23)

Figure 1: Entropy across layers measured using Stochastic Parameter Decompositionbushnaq. (a) Vanilla Mamba exhibits a sharp entropy spike at Layer 20, indicating a parameter-level routing bottleneck where diverse information is forced through a narrow subset of parameters. After (b) steering and (c) architectural modifications, the spike is removed and entropy becomes smoother, indicating restored information flow. These modifications yield improved performance across three standard benchmarks; NIAH: Needle in a haystack, QA: question-answering and Pathfinder: long-context benchmark.
Figure 2: Workflow for identifying Activation Subspace Bottlenecks and using them to conduct post-hoc steering or to make architectural modifications to Mamba.
Figure 3: Performance comparison of SSM models with and without transferred steering parameters. Models are trained on The Pile and evaluated on test sets of task-specific benchmarks: SQuAD (recall), IFEval (instruction following), RULER (long context), MuSiQue (multi-hop reasoning), DROP (basic reasoning) and TriviaQA (QA), representing query types where SSMs show strong performance guan2025qmambaexplorationvisionmambaWang2025M1TS. Solid and striped bars denote performance without and with steering, respectively.
Figure A1: Universality score distribution of subspaces. Mamba vs. Transformer
Figure A2: 2D PCA Projection of Universality subspaces. Mamba representations are less tightly clustered than Transformer representations. PC1 corresponds to task type; PC2 corresponds to model type (Mamba vs. Transformer)
...and 18 more figures

Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks

TL;DR

Abstract

Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks

Authors

TL;DR

Abstract

Table of Contents

Figures (23)