Table of Contents
Fetching ...

Efficient State Space Model via Fast Tensor Convolution and Block Diagonalization

Tongyi Liang, Han-Xiong Li

TL;DR

Efficient long-sequence modeling remains challenging due to parameter and compute bottlenecks. The authors propose eSSM, a MIMO SSM layer that leverages diagonalization, FFT-based fast tensor convolution, and block-diagonalization (multi-head) to decouple dynamics, accelerate convolution, and reduce parameters, while also offering a bidirectional non-causal variant. Across extensive benchmarks, eSSM matches or exceeds state-of-the-art SSMs like S4/S5 and often outperforms Transformer and LSTM baselines, with substantially fewer parameters (as low as a fraction of LSTM/Mamba) and faster training times. The work provides a scalable, efficient approach for long-sequence modeling and suggests a fruitful integration of control-theory-inspired structures into neural network design. The combination of HiPPO-inspired initialization, diagonalized dynamics, and FFT-based computation offers practical improvements for real-world sequence tasks such as language, speech, and vision.

Abstract

Existing models encounter bottlenecks in balancing performance and computational efficiency when modeling long sequences. Although the state space model (SSM) has achieved remarkable success in handling long sequence tasks, it still faces the problem of large number of parameters. In order to further improve the efficiency of SSM, we propose a new state space layer based on multiple-input multiple-output SSM, called efficient SSM (eSSM). Our eSSM is built on the convolutional representation of multi-input and multi-input (MIMO) SSM. We propose a variety of effective strategies to improve the computational efficiency. The diagonalization of the system matrix first decouples the original system. Then a fast tensor convolution is proposed based on the fast Fourier transform. In addition, the block diagonalization of the SSM further reduces the model parameters and improves the model flexibility. Extensive experimental results show that the performance of the proposed model on multiple databases matches the performance of state-of-the-art models, such as S4, and is significantly better than Transformers and LSTM. In the model efficiency benchmark, the parameters of eSSM are only 12.89\% of LSTM and 13.24\% of Mamba. The training speed of eSSM is 3.94 times faster than LSTM and 1.35 times faster than Mamba. Code is available at: \href{https://github.com/leonty1/essm}{https://github.com/leonty1/essm}.

Efficient State Space Model via Fast Tensor Convolution and Block Diagonalization

TL;DR

Efficient long-sequence modeling remains challenging due to parameter and compute bottlenecks. The authors propose eSSM, a MIMO SSM layer that leverages diagonalization, FFT-based fast tensor convolution, and block-diagonalization (multi-head) to decouple dynamics, accelerate convolution, and reduce parameters, while also offering a bidirectional non-causal variant. Across extensive benchmarks, eSSM matches or exceeds state-of-the-art SSMs like S4/S5 and often outperforms Transformer and LSTM baselines, with substantially fewer parameters (as low as a fraction of LSTM/Mamba) and faster training times. The work provides a scalable, efficient approach for long-sequence modeling and suggests a fruitful integration of control-theory-inspired structures into neural network design. The combination of HiPPO-inspired initialization, diagonalized dynamics, and FFT-based computation offers practical improvements for real-world sequence tasks such as language, speech, and vision.

Abstract

Existing models encounter bottlenecks in balancing performance and computational efficiency when modeling long sequences. Although the state space model (SSM) has achieved remarkable success in handling long sequence tasks, it still faces the problem of large number of parameters. In order to further improve the efficiency of SSM, we propose a new state space layer based on multiple-input multiple-output SSM, called efficient SSM (eSSM). Our eSSM is built on the convolutional representation of multi-input and multi-input (MIMO) SSM. We propose a variety of effective strategies to improve the computational efficiency. The diagonalization of the system matrix first decouples the original system. Then a fast tensor convolution is proposed based on the fast Fourier transform. In addition, the block diagonalization of the SSM further reduces the model parameters and improves the model flexibility. Extensive experimental results show that the performance of the proposed model on multiple databases matches the performance of state-of-the-art models, such as S4, and is significantly better than Transformers and LSTM. In the model efficiency benchmark, the parameters of eSSM are only 12.89\% of LSTM and 13.24\% of Mamba. The training speed of eSSM is 3.94 times faster than LSTM and 1.35 times faster than Mamba. Code is available at: \href{https://github.com/leonty1/essm}{https://github.com/leonty1/essm}.
Paper Structure (74 sections, 4 theorems, 34 equations, 11 figures, 10 tables)

This paper contains 74 sections, 4 theorems, 34 equations, 11 figures, 10 tables.

Key Result

Proposition 1

If all eigenvalues of $A$ have a negative real part, for any given initial state $\hat{x}_0$, the estimated state $\hat{x}_{k}$ would convergent to its real values $x_k$ over time.

Figures (11)

  • Figure 1: Evaluating the efficiency of eSSM and baseline methods for sequence modeling: (a). Model parameters vs. different input sizes. (b). Training time per epoch vs. different input sizes. (c). Training time per epoch vs. different input sequence lengths. Our eSSM is the most efficient model, outperforming Transformer, LSTM, and other SSM-based models.
  • Figure 2: The internal structure of state space models (SSMs) with multi-input and multi-output (MIMO) (a) and Multi-Head eSSM (b).
  • Figure 3: Architecture of deep eSSM.
  • Figure 4: Model transformation equivalence. The blue dashed line represents the original discrete SSM (Eq. \ref{['discret_ssm']}), the blue asterisks represent the diagonal SSM (Eq. \ref{['diagonal_ssm']}), the red solid line labeled ‘Conv’ denotes the convolution SSM directly computed using convolution (Eq. \ref{['convolution_ssm']}), and the circle labeled ‘FFT’ corresponds to the convolution SSM computed via fast tensor convolution (Eq. \ref{['convolution_fft']}). All models yield identical results.
  • Figure 5: State convergence. The convolutional SSM with biased initial state in red line, labeled as 'Biased eSSM' would converge to the true state of original system in blue line over time.
  • ...and 6 more figures

Theorems & Definitions (7)

  • Proposition 1: State Convergence
  • proof
  • Lemma 1: Diagonalization Equivalence
  • Proposition 2: Efficient Convolution with Diagonalization
  • proof
  • Proposition 3
  • proof