Table of Contents
Fetching ...

A Theoretical Analysis of Mamba's Training Dynamics: Filtering Relevant Features for Generalization in State Space Models

Mugunthan Shandirasegaran, Hongkang Li, Songyang Zhang, Meng Wang, Shuai Zhang

TL;DR

This work analyzes the training dynamics of a simplified Mamba block with input-dependent gating, deriving non-asymptotic sample complexity and gradient-descent convergence bounds for guaranteed generalization under two structured data regimes. The gating vector is shown to align with class-relevant features while suppressing irrelevant ones, formalizing feature selection via selective recurrence. For majority-voting data, the results yield sample complexity $N\ge \Omega\left( \dfrac{L^2 d}{\eta^2(\alpha_r-\alpha_c)^2} \right)$ and iterations $T=\Theta\left( \dfrac{L^2}{\eta(\alpha_r-\alpha_c)^2} \right)$; for locality-structured data, the bounds depend on $[(1/2)^{\Delta L_{\mathbf{o}_+}^+}-(1/2)^{\Delta L_{\mathbf{o}_+}^-}]$. The empirical experiments on synthetic data corroborate the theory, showing gating-enhanced alignment and locality-driven convergence, providing a principled counterpoint to transformer-centric explanations.

Abstract

The recent empirical success of Mamba and other selective state space models (SSMs) has renewed interest in non-attention architectures for sequence modeling, yet their theoretical foundations remain underexplored. We present a first-step analysis of generalization and learning dynamics for a simplified but representative Mamba block: a single-layer, single-head selective SSM with input-dependent gating, followed by a two-layer MLP trained via gradient descent (GD). Our study adopts a structured data model with tokens that include both class-relevant and class-irrelevant patterns under token-level noise and examines two canonical regimes: majority-voting and locality-structured data sequences. We prove that the model achieves guaranteed generalization by establishing non-asymptotic sample complexity and convergence rate bounds, which improve as the effective signal increases and the noise decreases. Furthermore, we show that the gating vector aligns with class-relevant features while ignoring irrelevant ones, thereby formalizing a feature-selection role similar to attention but realized through selective recurrence. Numerical experiments on synthetic data justify our theoretical results. Overall, our results provide principled insight into when and why Mamba-style selective SSMs learn efficiently, offering a theoretical counterpoint to Transformer-centric explanations.

A Theoretical Analysis of Mamba's Training Dynamics: Filtering Relevant Features for Generalization in State Space Models

TL;DR

This work analyzes the training dynamics of a simplified Mamba block with input-dependent gating, deriving non-asymptotic sample complexity and gradient-descent convergence bounds for guaranteed generalization under two structured data regimes. The gating vector is shown to align with class-relevant features while suppressing irrelevant ones, formalizing feature selection via selective recurrence. For majority-voting data, the results yield sample complexity and iterations ; for locality-structured data, the bounds depend on . The empirical experiments on synthetic data corroborate the theory, showing gating-enhanced alignment and locality-driven convergence, providing a principled counterpoint to transformer-centric explanations.

Abstract

The recent empirical success of Mamba and other selective state space models (SSMs) has renewed interest in non-attention architectures for sequence modeling, yet their theoretical foundations remain underexplored. We present a first-step analysis of generalization and learning dynamics for a simplified but representative Mamba block: a single-layer, single-head selective SSM with input-dependent gating, followed by a two-layer MLP trained via gradient descent (GD). Our study adopts a structured data model with tokens that include both class-relevant and class-irrelevant patterns under token-level noise and examines two canonical regimes: majority-voting and locality-structured data sequences. We prove that the model achieves guaranteed generalization by establishing non-asymptotic sample complexity and convergence rate bounds, which improve as the effective signal increases and the noise decreases. Furthermore, we show that the gating vector aligns with class-relevant features while ignoring irrelevant ones, thereby formalizing a feature-selection role similar to attention but realized through selective recurrence. Numerical experiments on synthetic data justify our theoretical results. Overall, our results provide principled insight into when and why Mamba-style selective SSMs learn efficiently, offering a theoretical counterpoint to Transformer-centric explanations.
Paper Structure (41 sections, 16 theorems, 418 equations, 18 figures, 3 tables)

This paper contains 41 sections, 16 theorems, 418 equations, 18 figures, 3 tables.

Key Result

Lemma 4.1

With initialization where each entry of $\bm{W}_O$ is drawn independently from $\mathcal{N}(0, \xi^2)$ and $\bm{w}_\Delta^{(0)} = 0$. With a sufficient number of training samples and iterations, we have

Figures (18)

  • Figure 1: Convergence vs. majority-voting gap.
  • Figure 2: Alignment of $\bm{w}_\Delta$ for majority-voting data.
  • Figure 3: Convergence under locality-structured data.
  • Figure 4: Alignment of $\bm{w}_\Delta$ for locality-structured data.
  • Figure 5: Average alignment of $\bm{W}_{O(i,\cdot)}$ during training.
  • ...and 13 more figures

Theorems & Definitions (26)

  • Lemma 4.1: Gating Vector Alignment for Majority Voting Data
  • Theorem 1: Generalization for Majority Voting Data
  • Lemma 4.2: Gating Vector Alignment for Locality-structured Data
  • Theorem 2: Generalization for Locality-structured Data
  • Lemma B.1
  • Lemma B.2
  • Lemma B.3
  • Lemma B.4
  • Lemma B.5
  • Lemma B.6
  • ...and 16 more