State Space Models are Provably Comparable to Transformers in Dynamic Token Selection
Naoki Nishikawa, Taiji Suzuki
TL;DR
This work investigates whether state space models (SSMs), when combined with nonlinear layers, can achieve dynamic token selection comparable to Transformers in sequence modeling. The authors formulate a deep SSM-FNN architecture with embedding, convolution, and feedforward components and prove theoretical results showing dynamic token selection capabilities on input copying and associative recall tasks, as well as competitive nonparametric regression performance for piecewise gamma-smooth function classes. They demonstrate that two-layer SSMs with pre- and post-FNNs can mimic attention with poly-logarithmic parameter growth, achieving error $\epsilon$ in key tasks, and establish estimation rates matching Transformer-based approaches up to poly-log factors. Empirically, experiments on genomic data reveal sparse important-token distributions and input-dependent focusing behavior, supporting the practicality of SSM-based architectures as efficient Transformer alternatives for long-range sequence modeling.
Abstract
Deep neural networks based on state space models (SSMs) are attracting significant attention in sequence modeling since their computational cost is much smaller than that of Transformers. While the capabilities of SSMs have been demonstrated through experiments in various tasks, theoretical understanding of SSMs is still limited. In particular, most theoretical studies discuss the capabilities of SSM layers without nonlinear layers, and there is a lack of discussion on their combination with nonlinear layers. In this paper, we explore the capabilities of SSMs combined with fully connected neural networks, and show that they are comparable to Transformers in extracting the essential tokens depending on the input. As concrete examples, we consider two synthetic tasks, which are challenging for a single SSM layer, and demonstrate that SSMs combined with nonlinear layers can efficiently solve these tasks. Furthermore, we study the nonparametric regression task, and prove that the ability of SSMs is equivalent to that of Transformers in estimating functions belonging to a certain class.
