Table of Contents
Fetching ...

Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

Peihao Wang, Ruisi Cai, Yuehao Wang, Jiajun Zhu, Pragya Srivastava, Zhangyang Wang, Pan Li

TL;DR

The paper analyzes State Space Models ($SSMs$) as scalable alternatives to transformers for long sequences, uncovering a fundamental recency bias and a depth-driven over-smoothing bottleneck. It provides both theoretical results, including an exponential decay bound on token influence and smoothing bounds, and extensive empirical validation across $SSM$ families such as $S4$ and $Mamba$. The authors propose a practical polarization technique that reserves two channels in the state-transition matrices, setting one to 1 and the other to 0, to simultaneously preserve historical information and slow smoothing. This polarization improves associative recall of long-range tokens and enables deeper architectures to better utilize extended contexts, with releases of source code to facilitate adoption and further research.

Abstract

Structured State Space Models (SSMs) have emerged as alternatives to transformers. While SSMs are often regarded as effective in capturing long-sequence dependencies, we rigorously demonstrate that they are inherently limited by strong recency bias. Our empirical studies also reveal that this bias impairs the models' ability to recall distant information and introduces robustness issues. Our scaling experiments then discovered that deeper structures in SSMs can facilitate the learning of long contexts. However, subsequent theoretical analysis reveals that as SSMs increase in depth, they exhibit another inevitable tendency toward over-smoothing, e.g., token representations becoming increasingly indistinguishable. This fundamental dilemma between recency and over-smoothing hinders the scalability of existing SSMs. Inspired by our theoretical findings, we propose to polarize two channels of the state transition matrices in SSMs, setting them to zero and one, respectively, simultaneously addressing recency bias and over-smoothing. Experiments demonstrate that our polarization technique consistently enhances the associative recall accuracy of long-range tokens and unlocks SSMs to benefit further from deeper architectures. All source codes are released at https://github.com/VITA-Group/SSM-Bottleneck.

Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

TL;DR

The paper analyzes State Space Models () as scalable alternatives to transformers for long sequences, uncovering a fundamental recency bias and a depth-driven over-smoothing bottleneck. It provides both theoretical results, including an exponential decay bound on token influence and smoothing bounds, and extensive empirical validation across families such as and . The authors propose a practical polarization technique that reserves two channels in the state-transition matrices, setting one to 1 and the other to 0, to simultaneously preserve historical information and slow smoothing. This polarization improves associative recall of long-range tokens and enables deeper architectures to better utilize extended contexts, with releases of source code to facilitate adoption and further research.

Abstract

Structured State Space Models (SSMs) have emerged as alternatives to transformers. While SSMs are often regarded as effective in capturing long-sequence dependencies, we rigorously demonstrate that they are inherently limited by strong recency bias. Our empirical studies also reveal that this bias impairs the models' ability to recall distant information and introduces robustness issues. Our scaling experiments then discovered that deeper structures in SSMs can facilitate the learning of long contexts. However, subsequent theoretical analysis reveals that as SSMs increase in depth, they exhibit another inevitable tendency toward over-smoothing, e.g., token representations becoming increasingly indistinguishable. This fundamental dilemma between recency and over-smoothing hinders the scalability of existing SSMs. Inspired by our theoretical findings, we propose to polarize two channels of the state transition matrices in SSMs, setting them to zero and one, respectively, simultaneously addressing recency bias and over-smoothing. Experiments demonstrate that our polarization technique consistently enhances the associative recall accuracy of long-range tokens and unlocks SSMs to benefit further from deeper architectures. All source codes are released at https://github.com/VITA-Group/SSM-Bottleneck.
Paper Structure (54 sections, 8 theorems, 37 equations, 10 figures, 5 tables)

This paper contains 54 sections, 8 theorems, 37 equations, 10 figures, 5 tables.

Key Result

Theorem 3.1

Consider an SSM defined in Eq. eqn:ssm with $\{(\boldsymbol{A}_t, \boldsymbol{b}_t, c_t, \Delta_t)\}_{t \in [T]}$. Assume that (i) the input space $\mathcal{X} \subset \mathbb{R}^T$ is compact, (ii) $\{(\boldsymbol{A}_t, \boldsymbol{b}_t, c_t, \Delta_t)\}_{t \in [T]}$ are continuous and have continu

Figures (10)

  • Figure 1: Visualization of log influential scores $\log |\partial \boldsymbol{y}_t / \partial \boldsymbol{x}_s|$ versus distance $(t-s)$.
  • Figure 2: Comparison between SSM and Transformer on the "Needle in a Haystack" benchmark. The left figure shows the retrieval accuracy of the Mamba-Codestral-7B model, while the right figure presents the retrieval accuracy of the Mistral-7B model. We present a heatmap where "full context length" refers to the total length of the document, and "needle position" denotes the relative position of the statement to be retrieved within the context. See more fine-grained visualization in Appendix \ref{['sec:app:needle']}.
  • Figure 3: Results of target attack experiments on CIFAR-10, where "horse" is the target class. (a) and (b) present target attack success rates under two attack ratios. Lower success rates suggest higher robustness in the corresponding attack regions.
  • Figure 4: We empirically observe that deeper models become increasingly advantageous as the context length grows. However, beyond a certain depth, the performance of SSMs begins to plateau and eventually declines.
  • Figure 5: Visualization of feature smoothness across layers in pre-trained Mamba and Pythia. The y-axis represents the average pairwise differences among tokens. Mixer outputs (b) solely consider the Mamba or attention module, while Block outputs (c) include all other components (e.g., MLP).
  • ...and 5 more figures

Theorems & Definitions (14)

  • Theorem 3.1: Recency of SSMs
  • Proposition 4.1: Low-pass filtering of continuous S4
  • Theorem 4.2: Over-smoothing of SSMs
  • Lemma D.1: Parallel form
  • proof
  • Theorem D.2: Recency of SSMs
  • proof : Proof of Theorem \ref{['thm:local']}
  • Definition D.3: Low-pass Filter
  • Proposition D.4: Formal version of Proposition \ref{['prop:s4_low_pass']}
  • proof
  • ...and 4 more