A Theoretical Analysis of Mamba's Training Dynamics: Filtering Relevant Features for Generalization in State Space Models

Mugunthan Shandirasegaran; Hongkang Li; Songyang Zhang; Meng Wang; Shuai Zhang

A Theoretical Analysis of Mamba's Training Dynamics: Filtering Relevant Features for Generalization in State Space Models

Mugunthan Shandirasegaran, Hongkang Li, Songyang Zhang, Meng Wang, Shuai Zhang

TL;DR

This work analyzes the training dynamics of a simplified Mamba block with input-dependent gating, deriving non-asymptotic sample complexity and gradient-descent convergence bounds for guaranteed generalization under two structured data regimes. The gating vector is shown to align with class-relevant features while suppressing irrelevant ones, formalizing feature selection via selective recurrence. For majority-voting data, the results yield sample complexity $N\ge \Omega\left( \dfrac{L^2 d}{\eta^2(\alpha_r-\alpha_c)^2} \right)$ and iterations $T=\Theta\left( \dfrac{L^2}{\eta(\alpha_r-\alpha_c)^2} \right)$; for locality-structured data, the bounds depend on $[(1/2)^{\Delta L_{\mathbf{o}_+}^+}-(1/2)^{\Delta L_{\mathbf{o}_+}^-}]$. The empirical experiments on synthetic data corroborate the theory, showing gating-enhanced alignment and locality-driven convergence, providing a principled counterpoint to transformer-centric explanations.

Abstract

The recent empirical success of Mamba and other selective state space models (SSMs) has renewed interest in non-attention architectures for sequence modeling, yet their theoretical foundations remain underexplored. We present a first-step analysis of generalization and learning dynamics for a simplified but representative Mamba block: a single-layer, single-head selective SSM with input-dependent gating, followed by a two-layer MLP trained via gradient descent (GD). Our study adopts a structured data model with tokens that include both class-relevant and class-irrelevant patterns under token-level noise and examines two canonical regimes: majority-voting and locality-structured data sequences. We prove that the model achieves guaranteed generalization by establishing non-asymptotic sample complexity and convergence rate bounds, which improve as the effective signal increases and the noise decreases. Furthermore, we show that the gating vector aligns with class-relevant features while ignoring irrelevant ones, thereby formalizing a feature-selection role similar to attention but realized through selective recurrence. Numerical experiments on synthetic data justify our theoretical results. Overall, our results provide principled insight into when and why Mamba-style selective SSMs learn efficiently, offering a theoretical counterpoint to Transformer-centric explanations.

A Theoretical Analysis of Mamba's Training Dynamics: Filtering Relevant Features for Generalization in State Space Models

TL;DR

and iterations

; for locality-structured data, the bounds depend on

. The empirical experiments on synthetic data corroborate the theory, showing gating-enhanced alignment and locality-driven convergence, providing a principled counterpoint to transformer-centric explanations.

Abstract

Paper Structure (41 sections, 16 theorems, 418 equations, 18 figures, 3 tables)

This paper contains 41 sections, 16 theorems, 418 equations, 18 figures, 3 tables.

Introduction
Related Work
Preliminaries
Problem Formulation
Theoretical Results
Key Takeaways and Insights of the Findings
Data Model
Formal Theoretical Results
Theoretical Results for Majority-Voting Data
Theoretical Results for Locality-Structured Data
Technical Novelty and Challenges
Numerical Experiments
Conclusion
Notations, Proof Sketch and Additional Experiments
Notations
...and 26 more sections

Key Result

Lemma 4.1

With initialization where each entry of $\bm{W}_O$ is drawn independently from $\mathcal{N}(0, \xi^2)$ and $\bm{w}_\Delta^{(0)} = 0$. With a sufficient number of training samples and iterations, we have

Figures (18)

Figure 1: Convergence vs. majority-voting gap.
Figure 2: Alignment of $\bm{w}_\Delta$ for majority-voting data.
Figure 3: Convergence under locality-structured data.
Figure 4: Alignment of $\bm{w}_\Delta$ for locality-structured data.
Figure 5: Average alignment of $\bm{W}_{O(i,\cdot)}$ during training.
...and 13 more figures

Theorems & Definitions (26)

Lemma 4.1: Gating Vector Alignment for Majority Voting Data
Theorem 1: Generalization for Majority Voting Data
Lemma 4.2: Gating Vector Alignment for Locality-structured Data
Theorem 2: Generalization for Locality-structured Data
Lemma B.1
Lemma B.2
Lemma B.3
Lemma B.4
Lemma B.5
Lemma B.6
...and 16 more

A Theoretical Analysis of Mamba's Training Dynamics: Filtering Relevant Features for Generalization in State Space Models

TL;DR

Abstract

A Theoretical Analysis of Mamba's Training Dynamics: Filtering Relevant Features for Generalization in State Space Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (26)