Table of Contents
Fetching ...

Mamba2D: A Natively Multi-Dimensional State-Space Model for Vision Tasks

Enis Baty, Alejandro Hernández Díaz, Chris Bridges, Rebecca Davidson, Steve Eckersley, Simon Hadfield

TL;DR

This paper introduces Mamba2D, a native two-dimensional State-Space Model (SSM) for vision that overcomes the 1D and NLP-centric biases of prior SSMs. By employing dual axis parameterizations and a 2D wavefront scan, Mamba2D preserves spatial coherence across both image dimensions and enables efficient parallel computation, avoiding the limitations of flattening and 1D scans. The architecture integrates Mamba2D as a token mixer within a MetaFormer-style backbone, using Mamba2D in early stages and attention later, achieving competitive ImageNet-1K performance with a compact parameter count. The work provides a public implementation and suggests promising directions for scalable, long-range spatial modeling in vision tasks.

Abstract

State-Space Models (SSMs) have recently emerged as a powerful and efficient alternative to the long-standing transformer architecture. However, existing SSM conceptualizations retain deeply rooted biases from their roots in natural language processing. This constrains their ability to appropriately model the spatially-dependent characteristics of visual inputs. In this paper, we address these limitations by re-deriving modern selective state-space techniques, starting from a natively multidimensional formulation. Currently, prior works attempt to apply natively 1D SSMs to 2D data (i.e. images) by relying on arbitrary combinations of 1D scan directions to capture spatial dependencies. In contrast, Mamba2D improves upon this with a single 2D scan direction that factors in both dimensions of the input natively, effectively modelling spatial dependencies when constructing hidden states. Mamba2D shows comparable performance to prior adaptations of SSMs for vision tasks, on standard image classification evaluations with the ImageNet-1K dataset. Source code is available at https://github.com/cocoalex00/Mamba2D.

Mamba2D: A Natively Multi-Dimensional State-Space Model for Vision Tasks

TL;DR

This paper introduces Mamba2D, a native two-dimensional State-Space Model (SSM) for vision that overcomes the 1D and NLP-centric biases of prior SSMs. By employing dual axis parameterizations and a 2D wavefront scan, Mamba2D preserves spatial coherence across both image dimensions and enables efficient parallel computation, avoiding the limitations of flattening and 1D scans. The architecture integrates Mamba2D as a token mixer within a MetaFormer-style backbone, using Mamba2D in early stages and attention later, achieving competitive ImageNet-1K performance with a compact parameter count. The work provides a public implementation and suggests promising directions for scalable, long-range spatial modeling in vision tasks.

Abstract

State-Space Models (SSMs) have recently emerged as a powerful and efficient alternative to the long-standing transformer architecture. However, existing SSM conceptualizations retain deeply rooted biases from their roots in natural language processing. This constrains their ability to appropriately model the spatially-dependent characteristics of visual inputs. In this paper, we address these limitations by re-deriving modern selective state-space techniques, starting from a natively multidimensional formulation. Currently, prior works attempt to apply natively 1D SSMs to 2D data (i.e. images) by relying on arbitrary combinations of 1D scan directions to capture spatial dependencies. In contrast, Mamba2D improves upon this with a single 2D scan direction that factors in both dimensions of the input natively, effectively modelling spatial dependencies when constructing hidden states. Mamba2D shows comparable performance to prior adaptations of SSMs for vision tasks, on standard image classification evaluations with the ImageNet-1K dataset. Source code is available at https://github.com/cocoalex00/Mamba2D.

Paper Structure

This paper contains 14 sections, 12 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Illustration of a typical 1D Mamba scan (left) and our Mamba2D wavefront scan (right). Our reformulation of the S6/Mamba scan retains the spatial coherency between adjacent pixels or tokens in 2D. Dashed purple lines indicate each diagonal wavefront at which hidden states are computed in parallel.
  • Figure 2: Architecture of Mamba2D Block. Two parallel branches implement our Mamba2D SSM alongside the local processing path, followed by an FFN in line with traditional transformer-style blocks.
  • Figure 3: Architecture of our Mamba2D Model. A convolutional stem performs an initial patch embedding of the input image, followed by 4 stages of further feature extraction. Each stage consists of $N_{1 \ldots 4}$ blocks containing a token mixer followed by an FFN. As shown, we opt to use Mamba2D as the token mixer for the first two stages where spatial relations are more impactful. The final two stages comprise of vanilla attention for a lossless encoding of the channel-wise relations as the spatial size of the features are diminished.