State Space Models are Provably Comparable to Transformers in Dynamic Token Selection

Naoki Nishikawa; Taiji Suzuki

State Space Models are Provably Comparable to Transformers in Dynamic Token Selection

Naoki Nishikawa, Taiji Suzuki

TL;DR

This work investigates whether state space models (SSMs), when combined with nonlinear layers, can achieve dynamic token selection comparable to Transformers in sequence modeling. The authors formulate a deep SSM-FNN architecture with embedding, convolution, and feedforward components and prove theoretical results showing dynamic token selection capabilities on input copying and associative recall tasks, as well as competitive nonparametric regression performance for piecewise gamma-smooth function classes. They demonstrate that two-layer SSMs with pre- and post-FNNs can mimic attention with poly-logarithmic parameter growth, achieving error $\epsilon$ in key tasks, and establish estimation rates matching Transformer-based approaches up to poly-log factors. Empirically, experiments on genomic data reveal sparse important-token distributions and input-dependent focusing behavior, supporting the practicality of SSM-based architectures as efficient Transformer alternatives for long-range sequence modeling.

Abstract

Deep neural networks based on state space models (SSMs) are attracting significant attention in sequence modeling since their computational cost is much smaller than that of Transformers. While the capabilities of SSMs have been demonstrated through experiments in various tasks, theoretical understanding of SSMs is still limited. In particular, most theoretical studies discuss the capabilities of SSM layers without nonlinear layers, and there is a lack of discussion on their combination with nonlinear layers. In this paper, we explore the capabilities of SSMs combined with fully connected neural networks, and show that they are comparable to Transformers in extracting the essential tokens depending on the input. As concrete examples, we consider two synthetic tasks, which are challenging for a single SSM layer, and demonstrate that SSMs combined with nonlinear layers can efficiently solve these tasks. Furthermore, we study the nonparametric regression task, and prove that the ability of SSMs is equivalent to that of Transformers in estimating functions belonging to a certain class.

State Space Models are Provably Comparable to Transformers in Dynamic Token Selection

TL;DR

in key tasks, and establish estimation rates matching Transformer-based approaches up to poly-log factors. Empirically, experiments on genomic data reveal sparse important-token distributions and input-dependent focusing behavior, supporting the practicality of SSM-based architectures as efficient Transformer alternatives for long-range sequence modeling.

Abstract

Paper Structure (47 sections, 25 theorems, 236 equations, 6 figures)

This paper contains 47 sections, 25 theorems, 236 equations, 6 figures.

Introduction
Other related works.
Notations.
The Definition of Deep Neural Networks with SSMs
(i) FNN layer
(ii) Convolution layer
(iii) Embedding layer
Synthetic Tasks: Input Copying and Associative Recall
Input Copying
Associative Recall
SSMs Mimic Attention Mechanisms to Select Important Tokens
Nonparametric Regression Problem
Problem setting
Piecewise $\gamma$-smooth function class
$\gamma$-smooth function class.
...and 32 more sections

Key Result

Theorem 3.1

For any $\epsilon>0$, there exists an SSM $\hat{F}\in\mathcal{S}(M, U, D, L, W, S, B)$ with and decoding layer $\mathrm{Dec}$ with $\norm{W_\mathrm{Dec}}_\infty\leq 1$ such that $\sup_{V'\in[V]}~\mathrm{err}_{V'}\leq\epsilon.$

Figures (6)

Figure 1.1: Conceptual illustrations of our theory. The abilities of SSMs are said to be limited since their filter is not data-dependent. However, when combined with nonlinear layers, SSMs are comparable to Transformers in terms of dynamic token selection. Indeed, experiments on associative recall tasks show that SSMs capture the important tokens in the sequence depending on the input, which is similar to the behavior of Transformers. The heatmap in the figure represents the importance of the token when the model predicts the output. Note that these are not artificial figures, but the actual results of the experiments.
Figure 5.1: The transition of the probability of correct classification when we repeatedly mask the input tokens.
Figure C.1: Empirical results for input copying task (left) and associative recall task (right). We compare the performance of single-layer SSMs (SSM + FNN), two-layer SSMs (SSM + FNN + SSM + FNN), and Transformers. The number in parentheses following "FNN" indicates the depth of the FNN. We can see that two-layer SSMs with sufficiently expressive FNN layers exhibit performance comparable to Transformers, and outperform single-layer SSMs.
Figure C.2: Empirical results for nonparametric regression. We can see that two-layer SSMs (with FNNs) perform better than one-layer SSMs, similar to input copying and associative recall.
Figure E.1: Intuitive explanation of piecewise $\gamma$-smooth functions. Left: For simplicity, consider a finite-length input sequence $X = \qty[x_{-4}, \ldots, x_{-1}, x_0]$. An importance function $\mu$ takes the sequence as input and determines the importance of the last token. Using the function $\mu$, the importance values of each token, $\mu(X_{-4}), \ldots, \mu(X_0)$, are determined. A permutation map $\Pi$ rearranges the tokens in ascending order of their importance. Finally, the rearranged tokens are fed into a $\gamma$-smooth function $f$. In the sorted sequence, tokens in the right have higher importance, and the function $f$ becomes less smooth for tokens positioned further to the right. Right: An intuitive explanation of how the smoothness of a function changes due to token reordering. As an example, consider a function with a 3-dimensional input vector $X = (x_1, x_2, x_3)$. Assume $f$ is only non-smooth in the direction of the second coordinate, while it is smooth in all other directions. If $X$ is directly fed into $f$, the second coordinate, $x_2$, is always the non-smooth direction. On the other hand, if the coordinates are rearranged by an input-dependent permutation map $\Pi$ before being passed to $f$, the smoothness of the function changes. For example, in the top-left region of the domain, the reordering might cause the second coordinate to correspond to $x_3$, making $x_3$ the non-smooth direction.
...and 1 more figures

Theorems & Definitions (43)

Theorem 3.1
Theorem 3.2
Lemma 3.3: Dynamic Token Selection by SSMs
Definition 4.1: $\gamma$-smooth function class
Definition 4.2: Piecewise $\gamma$-smooth function class
Remark 4.4
Theorem 4.5
Theorem 4.6
Theorem D.1
Theorem D.2
...and 33 more

State Space Models are Provably Comparable to Transformers in Dynamic Token Selection

TL;DR

Abstract

State Space Models are Provably Comparable to Transformers in Dynamic Token Selection

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (43)