Table of Contents
Fetching ...

On Structured State-Space Duality

Jerry Yao-Chieh Hu, Xiwen Zhang, Ali ElSheikh, Weimin Wu, Han Liu

TL;DR

This work extends Structured State-Space Duality (SSD) beyond scalar-identity SSMs to general diagonal SSMs, showing that diagonal dynamics can be represented as sums of 1-SS masked-attention components while preserving the same $O(TN)$ training and inference complexity as prior SSD formulations. It provides a constructive, necessary-and-sufficient condition for when an SSM has a $1$-semiseparable masked-attention dual and proves that the SSD framework cannot extend to standard softmax attention due to rank explosion. The results broaden the theoretical link between recurrent SSMs and Transformer-like attention, offering new architectures that blend linear-time computation with richer dynamics, while clearly delineating the limits of SSD. Thorough numerical validations corroborate the diagonal SSM vs. sum-of-heads 1-SS attention equivalence, the rank-behavior of semiseparable kernels, and the non-extensibility to softmax.

Abstract

Structured State-Space Duality (SSD) [Dao & Gu, ICML 2024] is an equivalence between a simple Structured State-Space Model (SSM) and a masked attention mechanism. In particular, a state-space model with a scalar-times-identity state matrix is equivalent to a masked self-attention with a $1$-semiseparable causal mask. Consequently, the same sequence transformation (model) has two algorithmic realizations: as a linear-time $O(T)$ recurrence or as a quadratic-time $O(T^2)$ attention. In this note, we formalize and generalize this duality: (i) we extend SSD from the scalar-identity case to general diagonal SSMs (diagonal state matrices); (ii) we show that these diagonal SSMs match the scalar case's training complexity lower bounds while supporting richer dynamics; (iii) we establish a necessary and sufficient condition under which an SSM is equivalent to $1$-semiseparable masked attention; and (iv) we show that such duality fails to extend to standard softmax attention due to rank explosion. Together, these results tighten bridge between recurrent SSMs and Transformers, and widen the design space for expressive yet efficient sequence models.

On Structured State-Space Duality

TL;DR

This work extends Structured State-Space Duality (SSD) beyond scalar-identity SSMs to general diagonal SSMs, showing that diagonal dynamics can be represented as sums of 1-SS masked-attention components while preserving the same training and inference complexity as prior SSD formulations. It provides a constructive, necessary-and-sufficient condition for when an SSM has a -semiseparable masked-attention dual and proves that the SSD framework cannot extend to standard softmax attention due to rank explosion. The results broaden the theoretical link between recurrent SSMs and Transformer-like attention, offering new architectures that blend linear-time computation with richer dynamics, while clearly delineating the limits of SSD. Thorough numerical validations corroborate the diagonal SSM vs. sum-of-heads 1-SS attention equivalence, the rank-behavior of semiseparable kernels, and the non-extensibility to softmax.

Abstract

Structured State-Space Duality (SSD) [Dao & Gu, ICML 2024] is an equivalence between a simple Structured State-Space Model (SSM) and a masked attention mechanism. In particular, a state-space model with a scalar-times-identity state matrix is equivalent to a masked self-attention with a -semiseparable causal mask. Consequently, the same sequence transformation (model) has two algorithmic realizations: as a linear-time recurrence or as a quadratic-time attention. In this note, we formalize and generalize this duality: (i) we extend SSD from the scalar-identity case to general diagonal SSMs (diagonal state matrices); (ii) we show that these diagonal SSMs match the scalar case's training complexity lower bounds while supporting richer dynamics; (iii) we establish a necessary and sufficient condition under which an SSM is equivalent to -semiseparable masked attention; and (iv) we show that such duality fails to extend to standard softmax attention due to rank explosion. Together, these results tighten bridge between recurrent SSMs and Transformers, and widen the design space for expressive yet efficient sequence models.

Paper Structure

This paper contains 86 sections, 8 theorems, 62 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Proposition 3.1

Consider the SSM defined by eqn:recurrence where each $A^t = a_t I_N$ (i.e. a scalar-identity SSM). Let $B = [b_1; b_2; \dots; b_T]^\top \in \mathbb{R}^{T\times N}$ and $C = [c_1; c_2; \dots; c_T]^\top \in \mathbb{R}^{T\times N}$ be the matrices whose $t$-th rows are $b_t^\top$ and $c_t^\top$, respe where Here $\odot$ denotes elementwise (Hadamard) product. In particular, the same sequence transf

Figures (9)

  • Figure 1: $M_{j,i} = c_j^\top A^j \cdots A^{i+1} b_i$
  • Figure 2: Construction of $b^n$ and $c^n$.
  • Figure 8: Wall-clock runtime vs. sequence length $T$ for the diagonal SSM implemented as a recurrence (gray, $O(T)$) and as explicit attention (red, $O(T^2)$). Curves show the mean over 100 runs; shaded regions denote $\pm$ one standard deviation. The recurrence scales linearly in $T$, while the attention implementation exhibits quadratic growth.
  • Figure 9: Experiment 5: numerical rank of the causal softmax attention matrix as a function of sequence length $T$. The solid curve shows the mean rank over random draws of $(Q,K)$ and the dashed line is the full-rank baseline $y=T$. The rank grows with $T$ and remains very close to full rank for all sequence lengths considered, indicating that generic softmax attention is effectively full rank.
  • Figure 10: Experiment 5: rank gap$T - \mathrm{rank}(A)$ for the same softmax attention matrices as in Figure \ref{['fig:softmax-rank']}. The gap grows slowly with $T$ but remains tiny compared to $T$ (on the order of $10^2$ even when $T \approx 5{,}000$), highlighting that generic softmax attention is effectively numerically full rank and does not admit a fixed low semiseparable rank.
  • ...and 4 more figures

Theorems & Definitions (31)

  • Definition 3.1: $N$-Semiseparable ($N$-SS) Matrix
  • Definition 3.2: $1$-Semiseparable ($1$-SS) Matrix
  • Remark 3.1: Prior Work
  • Definition 3.3: $1$-SS Masked Attention
  • Definition 3.4: $N$-Sequentially Semiseparable ($N$-SSS) Representation
  • Definition 3.5: $N$-SSS Representable Matrix
  • Proposition 3.1: dg24 Scalar-Identity State-Space Duality
  • Remark 4.1: An Example of "Richer Dynamics" of Diagonal SSM
  • Remark 4.2
  • Remark 4.3
  • ...and 21 more