On the Expressivity of Selective State-Space Layers: A Multivariate Polynomial Approach
Edo Cohen-Karlik, Itamar Zimerman, Liane Galanti, Ido Atad, Amir Globerson, Lior Wolf
TL;DR
This work introduces a multivariate polynomial framework to analyze the expressivity of selective state-space layers (S6) in the Mamba architecture for long-sequence modeling. It proves that S6 layers can express high-degree multivariate polynomials more efficiently than linear self-attention, requiring far fewer layers to capture complex interactions, and provides a length-agnostic generalization bound demonstrating that increased expressivity does not necessarily harm generalization. Theoretical results are supported by experiments on NLP and vision benchmarks and synthetic polynomial learning tasks, showing that a four-layer Mamba can represent broad polynomial classes while maintaining generalization. The findings illuminate fundamental differences between S6 and attention mechanisms, with practical implications for long-range sequence modeling and architecture design. The paper also discusses simplifications that preserve core behavior and their empirical validity, while acknowledging limitations and directions for extending the theory to full models.
Abstract
Recent advances in efficient sequence modeling have introduced selective state-space layers, a key component of the Mamba architecture, which have demonstrated remarkable success in a wide range of NLP and vision tasks. While Mamba's empirical performance has matched or surpassed SoTA transformers on such diverse benchmarks, the theoretical foundations underlying its powerful representational capabilities remain less explored. In this work, we investigate the expressivity of selective state-space layers using multivariate polynomials, and prove that they surpass linear transformers in expressiveness. Consequently, our findings reveal that Mamba offers superior representational power over linear attention-based models for long sequences, while not sacrificing their generalization. Our theoretical insights are validated by a comprehensive set of empirical experiments on various datasets.
