Table of Contents
Fetching ...

Demystifying the Token Dynamics of Deep Selective State Space Models

Thieu N Vo, Tung D. Pham, Xin T. Tong, Tan Minh Nguyen

TL;DR

This work tackles the theoretical understanding of token dynamics in deep selective state-space models, notably Mamba, by deriving a continuous-time limit and analyzing the asymptotics in the one-dimensional setting. The authors classify token behavior into convergence and two divergence regimes based on the input-output matrix $\mu = S_C^{\top}S_B$ and the step-size function $\Delta$, providing explicit rates such as $O(1/\sqrt{t})$ for convergence and $x_l(t)=O((\ln t)^l)$ for slow divergence, with finite-time blow-up in fast divergence. They show that convergence harms model performance, while divergence leads to unequal token contributions during training, motivating refinements: excluding the convergent regime and reordering tokens by estimated importance, which they validate on ImageNet and WikiText103. Overall, the results offer principled insights for improving the reliability and efficiency of Mamba-like models in real-world tasks by linking dynamical properties to practical training outcomes.

Abstract

Selective state space models (SSM), such as Mamba, have gained prominence for their effectiveness in modeling sequential data. Despite their outstanding empirical performance, a comprehensive theoretical understanding of deep selective SSM remains elusive, hindering their further development and adoption for applications that need high fidelity. In this paper, we investigate the dynamical properties of tokens in a pre-trained Mamba model. In particular, we derive the dynamical system governing the continuous-time limit of the Mamba model and characterize the asymptotic behavior of its solutions. In the one-dimensional case, we prove that only one of the following two scenarios happens: either all tokens converge to zero, or all tokens diverge to infinity. We provide criteria based on model parameters to determine when each scenario occurs. For the convergent scenario, we empirically verify that this scenario negatively impacts the model's performance. For the divergent scenario, we prove that different tokens will diverge to infinity at different rates, thereby contributing unequally to the updates during model training. Based on these investigations, we propose two refinements for the model: excluding the convergent scenario and reordering tokens based on their importance scores, both aimed at improving practical performance. Our experimental results validate these refinements, offering insights into enhancing Mamba's effectiveness in real-world applications.

Demystifying the Token Dynamics of Deep Selective State Space Models

TL;DR

This work tackles the theoretical understanding of token dynamics in deep selective state-space models, notably Mamba, by deriving a continuous-time limit and analyzing the asymptotics in the one-dimensional setting. The authors classify token behavior into convergence and two divergence regimes based on the input-output matrix and the step-size function , providing explicit rates such as for convergence and for slow divergence, with finite-time blow-up in fast divergence. They show that convergence harms model performance, while divergence leads to unequal token contributions during training, motivating refinements: excluding the convergent regime and reordering tokens by estimated importance, which they validate on ImageNet and WikiText103. Overall, the results offer principled insights for improving the reliability and efficiency of Mamba-like models in real-world tasks by linking dynamical properties to practical training outcomes.

Abstract

Selective state space models (SSM), such as Mamba, have gained prominence for their effectiveness in modeling sequential data. Despite their outstanding empirical performance, a comprehensive theoretical understanding of deep selective SSM remains elusive, hindering their further development and adoption for applications that need high fidelity. In this paper, we investigate the dynamical properties of tokens in a pre-trained Mamba model. In particular, we derive the dynamical system governing the continuous-time limit of the Mamba model and characterize the asymptotic behavior of its solutions. In the one-dimensional case, we prove that only one of the following two scenarios happens: either all tokens converge to zero, or all tokens diverge to infinity. We provide criteria based on model parameters to determine when each scenario occurs. For the convergent scenario, we empirically verify that this scenario negatively impacts the model's performance. For the divergent scenario, we prove that different tokens will diverge to infinity at different rates, thereby contributing unequally to the updates during model training. Based on these investigations, we propose two refinements for the model: excluding the convergent scenario and reordering tokens based on their importance scores, both aimed at improving practical performance. Our experimental results validate these refinements, offering insights into enhancing Mamba's effectiveness in real-world applications.
Paper Structure (34 sections, 12 theorems, 56 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 34 sections, 12 theorems, 56 equations, 8 figures, 10 tables, 1 algorithm.

Key Result

Theorem 4.1

Assume that $\mu=S_C^{\top}S_B<0$. Let $\mathbf{x}(t)=(x_1(t),\ldots,x_L(t))$ be the unique solution of the dynamic eq:mamba_dynamic_d=1_mainbody.

Figures (8)

  • Figure 1: The graphs of tokens (left) and the heat map of the hidden attention matrices (right) for the convergence scenario with $L=10$ tokens and parameters $\mu=-1.58$, $S_{\Delta}=-0.17$, $a=-1.08$, as well as initial data $\mathbf{x}(0)=(-1.79, -0.34, 0.46, -1.25, 0.83, -0.83, 1.81, -1.16, 0.13, -0.19)$. In this case, all tokens and hidden attention scores tend to zero as $t$ approaches infinity.
  • Figure 2: The graphs of tokens (left) and the heat map of the hidden attention matrices (right) for the convergence scenario with $L=10$ tokens and the model parameters $\mu=1.79$, $S_{\Delta}=-0.71$, $a=-1.80$, as well as the initial data $\mathbf{x}(0)=(1.55, 2.84, 3.81, 4.57, 5.99, 6.94, 7.71, 8.96, 9.59, 10.75)$. In this case, tokens tend to infinity at the log-rate, and the one which larger initial value diverges faster, while the hidden attention scores tend to zero as $t$ approaches infinity. In addition, the hidden attention scores from the second columns tend to zero must faster than those in the first column.
  • Figure 3: The graphs of tokens (left) and the heat map of the hidden attention matrices (right) for the convergence scenario with $L=10$ tokens and the model parameters $\mu=0.76$, $S_{\Delta}=0.59$, $a=1.66$, as well as the initial data $\mathbf{x}(0)=(0.83, 0.91, 0.64, 0.78, 0.66, 0.99, 0.68, 0.72, 0.61, 0.90)$. In this case, tokens and the hidden attention scores tend to infinity very quickly at finite time.
  • Figure 4: Test perplexity on WikiText103 during training procedure. The positive case consistently demonstrates superior performance compared to the other two scenarios.
  • Figure 5: Top-1 (left) and Top-5 (right) accuracy ($\uparrow$) on ImageNet-1K during the training process. Our token reordering method boosts the accuracy of the MambaVision baseline and achieves faster convergence compared to the baseline.
  • ...and 3 more figures

Theorems & Definitions (28)

  • Remark 1: Convergence vs. divergence scenarios and their impact
  • Remark 2: Unequal contribution of tokens during training
  • Remark 3: Higher dimension
  • Remark 4: Compare with tokens' dynamic in Transformer
  • Theorem 4.1: Convergence scenario
  • Remark 5: Negative impact on model's performance
  • Lemma 4.2: Slow divergence scenario
  • Theorem 4.3: Slow divergence scenario and divergence rate
  • Remark 6: Unequally tokens' contribution
  • Theorem 4.4: Fast divergence scenario
  • ...and 18 more