Table of Contents
Fetching ...

Generalization Error Analysis for Selective State-Space Models Through the Lens of Attention

Arya Honarpisheh, Mustafa Bozdag, Octavia Camps, Mario Sznaier

TL;DR

This work tackles the generalization capability of selective state-space models (SSMs) used in Mamba by deriving a novel covering-number-based bound that translates Transformer-style analysis to nonlinear, input-conditioned SSMs. The core insight is that the spectral abscissa $s_{oldsymbol{A}}$ of the continuous-time state matrix governs both training stability and length generalization, yielding a length-independent bound when $s_{oldsymbol{A}}<0$ and exponential growth when $s_{oldsymbol{A}}>0$. A two-tier covering argument connects the selective SSM dynamics to attention, enabling a Dudley-integral bound on the Rademacher complexity and a corollary for linear attention with bound scaling as $ ilde{O}(T)$. Empirical results on Majority, IMDb, and ListOps corroborate the theory, showing that training trajectories push $s_{oldsymbol{A}}$ toward stability and that stable regimes generalize consistently across long sequences. Overall, the paper provides theoretical guarantees and practical guidance for stabilizing selective SSMs while preserving long-range dependencies.

Abstract

State-space models (SSMs) have recently emerged as a compelling alternative to Transformers for sequence modeling tasks. This paper presents a theoretical generalization analysis of selective SSMs, the core architectural component behind the Mamba model. We derive a novel covering number-based generalization bound for selective SSMs, building upon recent theoretical advances in the analysis of Transformer models. Using this result, we analyze how the spectral abscissa of the continuous-time state matrix influences the model's stability during training and its ability to generalize across sequence lengths. We empirically validate our findings on a synthetic majority task, the IMDb sentiment classification benchmark, and the ListOps task, demonstrating how our theoretical insights translate into practical model behavior.

Generalization Error Analysis for Selective State-Space Models Through the Lens of Attention

TL;DR

This work tackles the generalization capability of selective state-space models (SSMs) used in Mamba by deriving a novel covering-number-based bound that translates Transformer-style analysis to nonlinear, input-conditioned SSMs. The core insight is that the spectral abscissa of the continuous-time state matrix governs both training stability and length generalization, yielding a length-independent bound when and exponential growth when . A two-tier covering argument connects the selective SSM dynamics to attention, enabling a Dudley-integral bound on the Rademacher complexity and a corollary for linear attention with bound scaling as . Empirical results on Majority, IMDb, and ListOps corroborate the theory, showing that training trajectories push toward stability and that stable regimes generalize consistently across long sequences. Overall, the paper provides theoretical guarantees and practical guidance for stabilizing selective SSMs while preserving long-range dependencies.

Abstract

State-space models (SSMs) have recently emerged as a compelling alternative to Transformers for sequence modeling tasks. This paper presents a theoretical generalization analysis of selective SSMs, the core architectural component behind the Mamba model. We derive a novel covering number-based generalization bound for selective SSMs, building upon recent theoretical advances in the analysis of Transformer models. Using this result, we analyze how the spectral abscissa of the continuous-time state matrix influences the model's stability during training and its ability to generalize across sequence lengths. We empirically validate our findings on a synthetic majority task, the IMDb sentiment classification benchmark, and the ListOps task, demonstrating how our theoretical insights translate into practical model behavior.

Paper Structure

This paper contains 26 sections, 23 theorems, 106 equations, 5 figures, 3 tables.

Key Result

Theorem 3.2

Given a real-valued function class $\mathcal{F} = \{ f:\mathcal{U} \rightarrow \mathbb{R} \}$ such that $\forall u \in \mathcal{U}, \; | f(u) | \leq \mathfrak{b}$ and a set of vectors $S = \{u_{(i)}\}_{i=1}^m$, we have

Figures (5)

  • Figure 1: Experiment 1.Top: Training loss vs epochs for Left: Majority, Middle: IMDb, Right: ListOps. Bottom: Evolution of $s_{\bm{A}}$ vs epochs for the same datasets. All runs use an unstable initialization with $s_{\bm{A}}=0.1$. Whenever training successfully reduces the loss, the spectral abscissa $s_{\bm{A}}$ is driven toward zero, indicating that the system becomes stable. In cases where $s_{\bm{A}}$ does not decrease toward zero, training is not successful.
  • Figure 2: Experiment 2.Left: Majority, Middle: IMDb, Right: ListOps. Train and test accuracy versus sequence length $T$ for models initialized with $s_{\bm{A}}=0$. The results demonstrate length-independent generalization. Each experiment is repeated five times with different random seeds; the dashed line denotes the mean accuracy across runs, and the shaded region represents $\pm$ one standard deviation.
  • Figure 3: Majority. Histogram of ones, $m=1000$ samples each for train and test, sequence length $T=200$.
  • Figure 4: IMDb. Histogram of sequence lengths for both the training and test splits.
  • Figure 5: Experiment 1.Top: Training loss vs epochs for Left: Majority, $T=250$, Middle: IMDb, $T=500$, Right: ListOps, $T=300$. Bottom: Evolution of $s_{\bm{A}}$ vs epochs for the same datasets. We sweep the $s_{\bm{A}}$ values from $-0.1$ to $0.1$ in $0.02$ increments.

Theorems & Definitions (47)

  • Definition 3.1: Covering number
  • Remark 3.1: Types of covering numbers
  • Theorem 3.2: bartlett2017spectrally, Lemma A.5
  • Theorem 3.3
  • proof : Proof Sketch of Theorem \ref{['thm:gen_err_bound_selective']}
  • Remark 3.2: Lens of Attention
  • Proposition 3.4
  • Remark 4.1: LTI SSMs
  • Theorem 4.1
  • Lemma C.1
  • ...and 37 more