Table of Contents
Fetching ...

2-D SSM: A General Spatial Layer for Visual Transformers

Ethan Baron, Itamar Zimerman, Lior Wolf

TL;DR

This work introduces a 2-D SSM layer based on Roesser's multidimensional state-space model to embed 2D inductive bias directly into Vision Transformers. The layer is parameter-efficient, numerically stable through diagonalized A matrices and normalization, and expressively rich enough to model full-rank 2D kernels, outperforming standard positional encodings in some setups. It serves as a general, plug-in spatial booster for ViT backbones (ViT, Mega, Swin) across multiple datasets with minimal inference overhead. Theoretical analysis shows the layer enhances spatial expressiveness beyond prior 2D SSMs like S4ND, and empirical results demonstrate improved accuracy with robust ablations on diverse vision benchmarks.

Abstract

A central objective in computer vision is to design models with appropriate 2-D inductive bias. Desiderata for 2D inductive bias include two-dimensional position awareness, dynamic spatial locality, and translation and permutation invariance. To address these goals, we leverage an expressive variation of the multidimensional State Space Model (SSM). Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme. Empirically, we observe that incorporating our layer at the beginning of each transformer block of Vision Transformers (ViT) significantly enhances performance for multiple ViT backbones and across datasets. The new layer is effective even with a negligible amount of additional parameters and inference time. Ablation studies and visualizations demonstrate that the layer has a strong 2-D inductive bias. For example, vision transformers equipped with our layer exhibit effective performance even without positional encoding

2-D SSM: A General Spatial Layer for Visual Transformers

TL;DR

This work introduces a 2-D SSM layer based on Roesser's multidimensional state-space model to embed 2D inductive bias directly into Vision Transformers. The layer is parameter-efficient, numerically stable through diagonalized A matrices and normalization, and expressively rich enough to model full-rank 2D kernels, outperforming standard positional encodings in some setups. It serves as a general, plug-in spatial booster for ViT backbones (ViT, Mega, Swin) across multiple datasets with minimal inference overhead. Theoretical analysis shows the layer enhances spatial expressiveness beyond prior 2D SSMs like S4ND, and empirical results demonstrate improved accuracy with robust ablations on diverse vision benchmarks.

Abstract

A central objective in computer vision is to design models with appropriate 2-D inductive bias. Desiderata for 2D inductive bias include two-dimensional position awareness, dynamic spatial locality, and translation and permutation invariance. To address these goals, we leverage an expressive variation of the multidimensional State Space Model (SSM). Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme. Empirically, we observe that incorporating our layer at the beginning of each transformer block of Vision Transformers (ViT) significantly enhances performance for multiple ViT backbones and across datasets. The new layer is effective even with a negligible amount of additional parameters and inference time. Ablation studies and visualizations demonstrate that the layer has a strong 2-D inductive bias. For example, vision transformers equipped with our layer exhibit effective performance even without positional encoding
Paper Structure (26 sections, 3 theorems, 33 equations, 6 figures, 3 tables)

This paper contains 26 sections, 3 theorems, 33 equations, 6 figures, 3 tables.

Key Result

Theorem 4.1

The $8$ parameters of the $2$-D SSM can express full-rank kernels

Figures (6)

  • Figure 1: (Left) The 2-D SSM layer is parameterized by A, B, C, and D. It is built on top of a two-axis linear recurrent and can be efficiently computed using 2-D convolution. (Center) Since the layer is based on two-dimensional recurrence, it exhibits a strong bias toward positional awareness. The recurrent is unrestricted, allowing the layer to operate on 2-D sequences of any length. The values of $A_1, A_2, A_3$, and $A_4$ control the layer's focus, enabling it to capture short or long spatial dependencies in horizontal, vertical, or diagonal directions, as opposed to patch-based models. (Right) The layer can be easily integrated into ViT by applying it to the two-dimensional sequence of patches at the beginning of each transformer block.
  • Figure 2: Examples of paths from coordinate $(\hat{i},\hat{j})=(0,0)$ to $(i,j)=(4,4)$. Each path represents a sequence of recursive calls for Eq. \ref{['eq:reqRule']}.
  • Figure 3: The kernels before and after the modifications of Sec. \ref{['par:relaxation']}. Each column is created by the same $A_1...A_4,B_1, B_2, C_1, C_2 \in \mathbb{R}$ parameters. The first row is the normalized 2-D SSM formulation explained in \ref{['eq:reqRule']}, the second is the outcome of Eq. \ref{['eq:reqRule_normalized']} and performing Eq. \ref{['eq:multi_by_2']}, which is the kernel formulation we use. The bottom left corner of each heatmap is $K_{0,0}$. The figures demonstrate that before the relaxation, the kernels displayed a diagonal tendency while afterward, they exhibited a more diverse and versatile pattern.
  • Figure 4: Accuracy on the CIFAR-10 grayscale classification task, which is part of the Long Range Arena.
  • Figure 5: ImageNet-1K accuracy of MEGA variants.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 4.1
  • Theorem 4.2
  • Theorem C.1
  • proof