Table of Contents
Fetching ...

State-Free Inference of State-Space Models: The Transfer Function Approach

Rom N. Parnichkun, Stefano Massaroli, Alessandro Moro, Jimmy T. H. Smith, Ramin Hasani, Mathias Lechner, Qi An, Christopher Ré, Hajime Asama, Stefano Ermon, Taiji Suzuki, Atsushi Yamashita, Michael Poli

TL;DR

The paper tackles the memory- and computation-heavy challenges of state-space models for sequence modeling by reframing SSMs through their transfer-function, rational transfer function (RTF), representation. It introduces a state-free parallel inference algorithm that computes the impulse-response spectrum via a single FFT, achieving $O(\ell)$ space and $O(\ell \log \ell)$ time, and demonstrates strong empirical gains on long-range tasks and language modeling. The approach yields state-of-the-art efficiency among attention-free models on Long Range Arena and improves perplexity on WikiText103 when integrated into Hyena-RTF, while addressing stability through initialization and constraint analysis. These results suggest RTF enables scalable, expressive, and efficient linear-time-invariant sequence processing across domains, with broad practical implications for fast autoregressive inference and large-state models.

Abstract

We approach designing a state-space model for deep learning applications through its dual representation, the transfer function, and uncover a highly efficient sequence parallel inference algorithm that is state-free: unlike other proposed algorithms, state-free inference does not incur any significant memory or computational cost with an increase in state size. We achieve this using properties of the proposed frequency domain transfer function parametrization, which enables direct computation of its corresponding convolutional kernel's spectrum via a single Fast Fourier Transform. Our experimental results across multiple sequence lengths and state sizes illustrates, on average, a 35% training speed improvement over S4 layers -- parametrized in time-domain -- on the Long Range Arena benchmark, while delivering state-of-the-art downstream performances over other attention-free approaches. Moreover, we report improved perplexity in language modeling over a long convolutional Hyena baseline, by simply introducing our transfer function parametrization. Our code is available at https://github.com/ruke1ire/RTF.

State-Free Inference of State-Space Models: The Transfer Function Approach

TL;DR

The paper tackles the memory- and computation-heavy challenges of state-space models for sequence modeling by reframing SSMs through their transfer-function, rational transfer function (RTF), representation. It introduces a state-free parallel inference algorithm that computes the impulse-response spectrum via a single FFT, achieving space and time, and demonstrates strong empirical gains on long-range tasks and language modeling. The approach yields state-of-the-art efficiency among attention-free models on Long Range Arena and improves perplexity on WikiText103 when integrated into Hyena-RTF, while addressing stability through initialization and constraint analysis. These results suggest RTF enables scalable, expressive, and efficient linear-time-invariant sequence processing across domains, with broad practical implications for fast autoregressive inference and large-state models.

Abstract

We approach designing a state-space model for deep learning applications through its dual representation, the transfer function, and uncover a highly efficient sequence parallel inference algorithm that is state-free: unlike other proposed algorithms, state-free inference does not incur any significant memory or computational cost with an increase in state size. We achieve this using properties of the proposed frequency domain transfer function parametrization, which enables direct computation of its corresponding convolutional kernel's spectrum via a single Fast Fourier Transform. Our experimental results across multiple sequence lengths and state sizes illustrates, on average, a 35% training speed improvement over S4 layers -- parametrized in time-domain -- on the Long Range Arena benchmark, while delivering state-of-the-art downstream performances over other attention-free approaches. Moreover, we report improved perplexity in language modeling over a long convolutional Hyena baseline, by simply introducing our transfer function parametrization. Our code is available at https://github.com/ruke1ire/RTF.
Paper Structure (49 sections, 6 theorems, 60 equations, 5 figures, 12 tables, 1 algorithm)

This paper contains 49 sections, 6 theorems, 60 equations, 5 figures, 12 tables, 1 algorithm.

Key Result

Lemma 3.1

Coefficients $a,b$ are invariant under any invertible change of variables.

Figures (5)

  • Figure 1: An illustration depicting the scaling of memory consumption on a scan-based algorithm (S5) and the proposed state-free inference algorithm denoted as RTF. We note that with larger state sizes, inference with S5 becomes prohibitively memory-intensive.
  • Figure 2: (a) The rational transfer function (RTF) representation comprises numerator and denominator polynomial coefficients $\textbf{b}$ and $\textbf{a}$, and the feedforward term $h_0$. (b) illustrates the proposed state-free parallel inference algorithm. The key to efficient state-free inference lies in casting $\textbf{b}$ and $\textbf{a}$ onto the sequence length for computing the convolutional filter $(h_i)_{i \in [\ell]}$. (c) illustrates the recurrent form of RTF which can be used for fast single-step inference. Here we denote the $i$-th state at time $t$ as $x_t^{i}$.
  • Figure 3: Latency profiles for a single RTF, S4D, and S4 layer at various state sizes. It is evident that RTF consistently exhibits superior parallel inference speeds, with its lower latency across a range of tasks and state sizes.
  • Figure 4: The space of stable roots of a 2nd order polynomial with conjugate roots is illustrated with a green-blue colormap. The figure on the right overlays the space of coefficients that obey Montel's constraints in pink.
  • Figure 5: This figure illustrates the scaling of parallel inference latency on S5 and RTF across various sequence lengths and state sizes. When comparing equal expansion factors, it becomes evident that RTF provides lower latencies across different sequence lengths.

Theorems & Definitions (9)

  • Lemma 3.1
  • Lemma 3.2
  • proof
  • Lemma 3.3
  • proof
  • Lemma 3.4
  • proof
  • Lemma 1.1: sandberg1963theory
  • Lemma 2.1