Table of Contents
Fetching ...

On the Expressivity of Selective State-Space Layers: A Multivariate Polynomial Approach

Edo Cohen-Karlik, Itamar Zimerman, Liane Galanti, Ido Atad, Amir Globerson, Lior Wolf

TL;DR

This work introduces a multivariate polynomial framework to analyze the expressivity of selective state-space layers (S6) in the Mamba architecture for long-sequence modeling. It proves that S6 layers can express high-degree multivariate polynomials more efficiently than linear self-attention, requiring far fewer layers to capture complex interactions, and provides a length-agnostic generalization bound demonstrating that increased expressivity does not necessarily harm generalization. Theoretical results are supported by experiments on NLP and vision benchmarks and synthetic polynomial learning tasks, showing that a four-layer Mamba can represent broad polynomial classes while maintaining generalization. The findings illuminate fundamental differences between S6 and attention mechanisms, with practical implications for long-range sequence modeling and architecture design. The paper also discusses simplifications that preserve core behavior and their empirical validity, while acknowledging limitations and directions for extending the theory to full models.

Abstract

Recent advances in efficient sequence modeling have introduced selective state-space layers, a key component of the Mamba architecture, which have demonstrated remarkable success in a wide range of NLP and vision tasks. While Mamba's empirical performance has matched or surpassed SoTA transformers on such diverse benchmarks, the theoretical foundations underlying its powerful representational capabilities remain less explored. In this work, we investigate the expressivity of selective state-space layers using multivariate polynomials, and prove that they surpass linear transformers in expressiveness. Consequently, our findings reveal that Mamba offers superior representational power over linear attention-based models for long sequences, while not sacrificing their generalization. Our theoretical insights are validated by a comprehensive set of empirical experiments on various datasets.

On the Expressivity of Selective State-Space Layers: A Multivariate Polynomial Approach

TL;DR

This work introduces a multivariate polynomial framework to analyze the expressivity of selective state-space layers (S6) in the Mamba architecture for long-sequence modeling. It proves that S6 layers can express high-degree multivariate polynomials more efficiently than linear self-attention, requiring far fewer layers to capture complex interactions, and provides a length-agnostic generalization bound demonstrating that increased expressivity does not necessarily harm generalization. Theoretical results are supported by experiments on NLP and vision benchmarks and synthetic polynomial learning tasks, showing that a four-layer Mamba can represent broad polynomial classes while maintaining generalization. The findings illuminate fundamental differences between S6 and attention mechanisms, with practical implications for long-range sequence modeling and architecture design. The paper also discusses simplifications that preserve core behavior and their empirical validity, while acknowledging limitations and directions for extending the theory to full models.

Abstract

Recent advances in efficient sequence modeling have introduced selective state-space layers, a key component of the Mamba architecture, which have demonstrated remarkable success in a wide range of NLP and vision tasks. While Mamba's empirical performance has matched or surpassed SoTA transformers on such diverse benchmarks, the theoretical foundations underlying its powerful representational capabilities remain less explored. In this work, we investigate the expressivity of selective state-space layers using multivariate polynomials, and prove that they surpass linear transformers in expressiveness. Consequently, our findings reveal that Mamba offers superior representational power over linear attention-based models for long sequences, while not sacrificing their generalization. Our theoretical insights are validated by a comprehensive set of empirical experiments on various datasets.

Paper Structure

This paper contains 21 sections, 14 theorems, 84 equations, 3 figures, 4 tables.

Key Result

Theorem 1

(informal) Consider an S6 layer and single Transformer layer, both with hidden dimension $N$. For input sequences of length $L \geq 3$, a single layer of Mamba is logarithmically more expressively efficient in depth compared to a single causal linear self-attention layer with a single head and polyn

Figures (3)

  • Figure 1: Expressivity via Polynomial Degree: Our characterization of SSMs, S6 layers, and causal self-attention via multivariate polynomials allows us to identify the expressiveness gap between these layers through maximal polynomial degree.
  • Figure 2: Visualization of 3-stacked Mamba layers expressing monomials of a univariate polynomial, as formulated in Lemma \ref{['lemma:3layerMambaExpresivity']}. To simplify the visualization, the Conv1D layer has been omitted.
  • Figure 3: Model justifications & ablations: In the left panel, we present the top-1 accuracy score for image classification via the ImageNet-100 benchmark, while the right panel displays the perplexity score for language modeling using the WikiText-103. The y-axis represents the model's score across different epochs. In both figures, the blue curve represents the baseline, the yellow curve corresponds to Eq.\ref{['eq:simplifiedModel']}, the green curve illustrates Eq.\ref{['eq:model']}, and the red curve depicts the polynomial variant using standard discretization.

Theorems & Definitions (23)

  • Theorem 1
  • Lemma 1
  • Lemma 2
  • proof : Proof of Lemma \ref{['lemma:dir2']} (without Softmax)
  • Theorem 2
  • Lemma 3
  • Theorem 3
  • Definition 1
  • Lemma 1
  • proof : Proof of Lemma \ref{['lemma:dir2appendix']}
  • ...and 13 more