Generalization Error Analysis for Selective State-Space Models Through the Lens of Attention
Arya Honarpisheh, Mustafa Bozdag, Octavia Camps, Mario Sznaier
TL;DR
This work tackles the generalization capability of selective state-space models (SSMs) used in Mamba by deriving a novel covering-number-based bound that translates Transformer-style analysis to nonlinear, input-conditioned SSMs. The core insight is that the spectral abscissa $s_{oldsymbol{A}}$ of the continuous-time state matrix governs both training stability and length generalization, yielding a length-independent bound when $s_{oldsymbol{A}}<0$ and exponential growth when $s_{oldsymbol{A}}>0$. A two-tier covering argument connects the selective SSM dynamics to attention, enabling a Dudley-integral bound on the Rademacher complexity and a corollary for linear attention with bound scaling as $ ilde{O}(T)$. Empirical results on Majority, IMDb, and ListOps corroborate the theory, showing that training trajectories push $s_{oldsymbol{A}}$ toward stability and that stable regimes generalize consistently across long sequences. Overall, the paper provides theoretical guarantees and practical guidance for stabilizing selective SSMs while preserving long-range dependencies.
Abstract
State-space models (SSMs) have recently emerged as a compelling alternative to Transformers for sequence modeling tasks. This paper presents a theoretical generalization analysis of selective SSMs, the core architectural component behind the Mamba model. We derive a novel covering number-based generalization bound for selective SSMs, building upon recent theoretical advances in the analysis of Transformer models. Using this result, we analyze how the spectral abscissa of the continuous-time state matrix influences the model's stability during training and its ability to generalize across sequence lengths. We empirically validate our findings on a synthetic majority task, the IMDb sentiment classification benchmark, and the ListOps task, demonstrating how our theoretical insights translate into practical model behavior.
