A Theoretical Analysis of Mamba's Training Dynamics: Filtering Relevant Features for Generalization in State Space Models
Mugunthan Shandirasegaran, Hongkang Li, Songyang Zhang, Meng Wang, Shuai Zhang
TL;DR
This work analyzes the training dynamics of a simplified Mamba block with input-dependent gating, deriving non-asymptotic sample complexity and gradient-descent convergence bounds for guaranteed generalization under two structured data regimes. The gating vector is shown to align with class-relevant features while suppressing irrelevant ones, formalizing feature selection via selective recurrence. For majority-voting data, the results yield sample complexity $N\ge \Omega\left( \dfrac{L^2 d}{\eta^2(\alpha_r-\alpha_c)^2} \right)$ and iterations $T=\Theta\left( \dfrac{L^2}{\eta(\alpha_r-\alpha_c)^2} \right)$; for locality-structured data, the bounds depend on $[(1/2)^{\Delta L_{\mathbf{o}_+}^+}-(1/2)^{\Delta L_{\mathbf{o}_+}^-}]$. The empirical experiments on synthetic data corroborate the theory, showing gating-enhanced alignment and locality-driven convergence, providing a principled counterpoint to transformer-centric explanations.
Abstract
The recent empirical success of Mamba and other selective state space models (SSMs) has renewed interest in non-attention architectures for sequence modeling, yet their theoretical foundations remain underexplored. We present a first-step analysis of generalization and learning dynamics for a simplified but representative Mamba block: a single-layer, single-head selective SSM with input-dependent gating, followed by a two-layer MLP trained via gradient descent (GD). Our study adopts a structured data model with tokens that include both class-relevant and class-irrelevant patterns under token-level noise and examines two canonical regimes: majority-voting and locality-structured data sequences. We prove that the model achieves guaranteed generalization by establishing non-asymptotic sample complexity and convergence rate bounds, which improve as the effective signal increases and the noise decreases. Furthermore, we show that the gating vector aligns with class-relevant features while ignoring irrelevant ones, thereby formalizing a feature-selection role similar to attention but realized through selective recurrence. Numerical experiments on synthetic data justify our theoretical results. Overall, our results provide principled insight into when and why Mamba-style selective SSMs learn efficiently, offering a theoretical counterpoint to Transformer-centric explanations.
