Table of Contents
Fetching ...

PDE-SSM: A Spectral State Space Approach to Spatial Mixing in Diffusion Transformers

Eshed Gal, Moshe Eliasof, Siddharth Rout, Eldad Haber

Abstract

The success of vision transformers-especially for generative modeling-is limited by the quadratic cost and weak spatial inductive bias of self-attention. We propose PDE-SSM, a spatial state-space block that replaces attention with a learnable convection-diffusion-reaction partial differential equation. This operator encodes a strong spatial prior by modeling information flow via physically grounded dynamics rather than all-to-all token interactions. Solving the PDE in the Fourier domain yields global coupling with near-linear complexity of $O(N \log N)$, delivering a principled and scalable alternative to attention. We integrate PDE-SSM into a flow-matching generative model to obtain the PDE-based Diffusion Transformer PDE-SSM-DiT. Empirically, PDE-SSM-DiT matches or exceeds the performance of state-of-the-art Diffusion Transformers while substantially reducing compute. Our results show that, analogous to 1D settings where SSMs supplant attention, multi-dimensional PDE operators provide an efficient, inductive-bias-rich foundation for next-generation vision models.

PDE-SSM: A Spectral State Space Approach to Spatial Mixing in Diffusion Transformers

Abstract

The success of vision transformers-especially for generative modeling-is limited by the quadratic cost and weak spatial inductive bias of self-attention. We propose PDE-SSM, a spatial state-space block that replaces attention with a learnable convection-diffusion-reaction partial differential equation. This operator encodes a strong spatial prior by modeling information flow via physically grounded dynamics rather than all-to-all token interactions. Solving the PDE in the Fourier domain yields global coupling with near-linear complexity of , delivering a principled and scalable alternative to attention. We integrate PDE-SSM into a flow-matching generative model to obtain the PDE-based Diffusion Transformer PDE-SSM-DiT. Empirically, PDE-SSM-DiT matches or exceeds the performance of state-of-the-art Diffusion Transformers while substantially reducing compute. Our results show that, analogous to 1D settings where SSMs supplant attention, multi-dimensional PDE operators provide an efficient, inductive-bias-rich foundation for next-generation vision models.
Paper Structure (39 sections, 13 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 39 sections, 13 equations, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: Visualizing the PDE-SSM Convolutional Kernels. By sampling the learnable parameters $\xi = (\mathcal{B}_{\gamma}, \zeta)$, our PDE-SSM can represent a diverse family of convolutional kernels. The examples show kernels that are (from left to right): localized, directionally blurred (anisotropic diffusion), shifted (convection), and a combination of effects. This flexibility allows our PDE-SSM model to learn a rich basis for spatial feature mixing, including non-local connections.
  • Figure 1: PDE-SSM Forward Pass
  • Figure 2: CIFAR-10 Images: (a) real images; (b) DiT; (c) PDE-SSM-DiT. Visual quality is comparable, in congruence with Table \ref{['tab:cifar10']}.
  • Figure 3: ImageNet$64$ training. (a) All methods converge at a similar rate and to an FID score that is similar. (b) The achieved FID score is consistent with the internal FID score.
  • Figure 4: LSUN-Churches generations: (a) real images; (b) DiT; (c) PDE-SSM-DiT.
  • ...and 3 more figures