Table of Contents
Fetching ...

Spectral State Space Models

Naman Agarwal, Daniel Suo, Xinyi Chen, Elad Hazan

TL;DR

The paper tackles the challenge of long-range sequence prediction by introducing Spectral State Space Models (SSMs), which use fixed spectral filters to capture long-range dependencies without reliance on the spectral gap or high dimensionality. It develops the Spectral Transform Unit (STU) and its stacked variants, combining a fixed spectral component with a small autoregressive portion to achieve stable, memory-efficient learning. The authors provide theoretical guarantees connecting spectral filtering to the expressive capacity of linear dynamical systems and validate the approach on synthetic data and the Long Range Arena benchmark, showing robustness and competitive performance without specialized initialization or normalization. Overall, the work offers a principled, computationally efficient alternative to Transformers for long-context modeling, with strong stability properties and practical applicability across modalities.

Abstract

This paper studies sequence modeling for prediction tasks with long range dependencies. We propose a new formulation for state space models (SSMs) based on learning linear dynamical systems with the spectral filtering algorithm (Hazan et al. (2017)). This gives rise to a novel sequence prediction architecture we call a spectral state space model. Spectral state space models have two primary advantages. First, they have provable robustness properties as their performance depends on neither the spectrum of the underlying dynamics nor the dimensionality of the problem. Second, these models are constructed with fixed convolutional filters that do not require learning while still outperforming SSMs in both theory and practice. The resulting models are evaluated on synthetic dynamical systems and long-range prediction tasks of various modalities. These evaluations support the theoretical benefits of spectral filtering for tasks requiring very long range memory.

Spectral State Space Models

TL;DR

The paper tackles the challenge of long-range sequence prediction by introducing Spectral State Space Models (SSMs), which use fixed spectral filters to capture long-range dependencies without reliance on the spectral gap or high dimensionality. It develops the Spectral Transform Unit (STU) and its stacked variants, combining a fixed spectral component with a small autoregressive portion to achieve stable, memory-efficient learning. The authors provide theoretical guarantees connecting spectral filtering to the expressive capacity of linear dynamical systems and validate the approach on synthetic data and the Long Range Arena benchmark, showing robustness and competitive performance without specialized initialization or normalization. Overall, the work offers a principled, computationally efficient alternative to Transformers for long-context modeling, with strong stability properties and practical applicability across modalities.

Abstract

This paper studies sequence modeling for prediction tasks with long range dependencies. We propose a new formulation for state space models (SSMs) based on learning linear dynamical systems with the spectral filtering algorithm (Hazan et al. (2017)). This gives rise to a novel sequence prediction architecture we call a spectral state space model. Spectral state space models have two primary advantages. First, they have provable robustness properties as their performance depends on neither the spectrum of the underlying dynamics nor the dimensionality of the problem. Second, these models are constructed with fixed convolutional filters that do not require learning while still outperforming SSMs in both theory and practice. The resulting models are evaluated on synthetic dynamical systems and long-range prediction tasks of various modalities. These evaluations support the theoretical benefits of spectral filtering for tasks requiring very long range memory.
Paper Structure (36 sections, 13 theorems, 74 equations, 6 figures, 2 tables)

This paper contains 36 sections, 13 theorems, 74 equations, 6 figures, 2 tables.

Key Result

Theorem 2.1

Given any $A,B,C,D$ such that $A$ is a PSD matrix with $\|A\| \leq 1$ and given any numbers $K \in \mathbb{I}^+, a \in {\mathbb R}^+$, there exists matrices $M^u_1, M^u_2, M^{\phi}_1,...,M^{\phi}_K$, such that for all $L$ and all sequences $u_{1:L}$ satisfying $\|u_t\| \leq a$ for all $t \in [L]$ th where $c \leq 10^6$ is a universal constant and $\|B\|_{\text{col}}$, $\|C\|_{\text{col}}$ are the

Figures (6)

  • Figure 1: Spectral Filters used by the Spectral Filtering Algorithm. The x-axis is the time domain.
  • Figure 2: Schematic showing the spectral projection of a 1-dimensional input sequence and how these features are used to produce the spectral component in the STU output \ref{['eqn:SFmain']}. In the multi-dimensional case the operation is applied in parallel across every input dimension.
  • Figure 3: Learning dynamics for learning a marginally stable LDS. (a.)(Smoothed) Learning curves for a single STU layer (red) vs a single LRU layer (black). The learning rate was tuned for both models. See Appendix for a detailed discussion of the tuning and sensitivity to hyperparameters for both the models. Curiously at stable LRs we observe that LRUs show a plateauing of learning. (b.) Error (in log-scale) obtained by the single STU layer as a function of the model parameter 'K'. We observe an exponential drop in the reconstruction loss as predicted by the analysis.
  • Figure 4: (Smoothed) Learning curves for learning a marginally stable LDS for a single STU layer (dashed) vs a single LRU layer (solid). Different colors represent different learning rates highlighting that the training becomes unstable for LRUs quickly as LR increases while the STU trains at much higher learning rates. Curiously at stable LRs we observe that LRUs show a platea-ing of learning for a large fraction of the training time.
  • Figure 5: LRU Hparam search vs STU. All the gray curves represent the hyperparameters for LRU we tried. The STU curve is the best taken from Figure \ref{['fig:training_loss_synth_lr_curves']}. For LRU we searched over choices of enabling stable exp-parameterization, gamma-normalization, ring-initialization, phase-initialization, learning rate, weight decay and constant vs warmup+cosine decay lr schedule.
  • ...and 1 more figures

Theorems & Definitions (19)

  • Theorem 2.1
  • Theorem 3.1
  • Remark 3.2
  • Theorem 5.1
  • Lemma C.1
  • Lemma C.2
  • Lemma C.3
  • Lemma C.4: Lemma E.3 hazan2017learning
  • proof : Proof of Lemma \ref{['lem:hankel-entries']}
  • proof : Proof of Lemma \ref{['lem:mu-props']}
  • ...and 9 more