2-D SSM: A General Spatial Layer for Visual Transformers
Ethan Baron, Itamar Zimerman, Lior Wolf
TL;DR
This work introduces a 2-D SSM layer based on Roesser's multidimensional state-space model to embed 2D inductive bias directly into Vision Transformers. The layer is parameter-efficient, numerically stable through diagonalized A matrices and normalization, and expressively rich enough to model full-rank 2D kernels, outperforming standard positional encodings in some setups. It serves as a general, plug-in spatial booster for ViT backbones (ViT, Mega, Swin) across multiple datasets with minimal inference overhead. Theoretical analysis shows the layer enhances spatial expressiveness beyond prior 2D SSMs like S4ND, and empirical results demonstrate improved accuracy with robust ablations on diverse vision benchmarks.
Abstract
A central objective in computer vision is to design models with appropriate 2-D inductive bias. Desiderata for 2D inductive bias include two-dimensional position awareness, dynamic spatial locality, and translation and permutation invariance. To address these goals, we leverage an expressive variation of the multidimensional State Space Model (SSM). Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme. Empirically, we observe that incorporating our layer at the beginning of each transformer block of Vision Transformers (ViT) significantly enhances performance for multiple ViT backbones and across datasets. The new layer is effective even with a negligible amount of additional parameters and inference time. Ablation studies and visualizations demonstrate that the layer has a strong 2-D inductive bias. For example, vision transformers equipped with our layer exhibit effective performance even without positional encoding
