Table of Contents
Fetching ...

PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition

Chenhongyi Yang, Zehui Chen, Miguel Espinosa, Linus Ericsson, Zhenyu Wang, Jiaming Liu, Elliot J. Crowley

TL;DR

PlainMamba presents a plain, non-hierarchical state-space model for visual recognition that extends Mamba to 2D images with continuous 2D scanning and direction-aware updating, while avoiding CLS tokens and maintaining constant width for scalable deployment. By integrating a convolutional tokenizer, identical stacked PlainMamba blocks, and a simple head, it achieves competitive results across ImageNet-1K, ADE20K, and COCO, particularly excelling in high-resolution settings with lower compute than hierarchical counterparts. Key innovations are Continuous 2D Scanning, which preserves spatial adjacency during token scanning, and Direction-Aware Updating, which injects 2D positional cues into the selective scan. Empirical results show PlainMamba outperforming previous non-hierarchical SSMs and approaching hierarchical models, offering a strong, simple baseline for future vision systems and downstream applications.

Abstract

We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for general visual recognition. The recent Mamba model has shown how SSMs can be highly competitive with other architectures on sequential data and initial attempts have been made to apply it to images. In this paper, we further adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images by (i) a continuous 2D scanning process that improves spatial continuity by ensuring adjacency of tokens in the scanning sequence, and (ii) direction-aware updating which enables the model to discern the spatial relations of tokens by encoding directional information. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks, resulting in a model with constant width throughout all layers. The architecture is further simplified by removing the need for special tokens. We evaluate PlainMamba on a variety of visual recognition tasks, achieving performance gains over previous non-hierarchical models and is competitive with hierarchical alternatives. For tasks requiring high-resolution inputs, in particular, PlainMamba requires much less computing while maintaining high performance. Code and models are available at: https://github.com/ChenhongyiYang/PlainMamba .

PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition

TL;DR

PlainMamba presents a plain, non-hierarchical state-space model for visual recognition that extends Mamba to 2D images with continuous 2D scanning and direction-aware updating, while avoiding CLS tokens and maintaining constant width for scalable deployment. By integrating a convolutional tokenizer, identical stacked PlainMamba blocks, and a simple head, it achieves competitive results across ImageNet-1K, ADE20K, and COCO, particularly excelling in high-resolution settings with lower compute than hierarchical counterparts. Key innovations are Continuous 2D Scanning, which preserves spatial adjacency during token scanning, and Direction-Aware Updating, which injects 2D positional cues into the selective scan. Empirical results show PlainMamba outperforming previous non-hierarchical SSMs and approaching hierarchical models, offering a strong, simple baseline for future vision systems and downstream applications.

Abstract

We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for general visual recognition. The recent Mamba model has shown how SSMs can be highly competitive with other architectures on sequential data and initial attempts have been made to apply it to images. In this paper, we further adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images by (i) a continuous 2D scanning process that improves spatial continuity by ensuring adjacency of tokens in the scanning sequence, and (ii) direction-aware updating which enables the model to discern the spatial relations of tokens by encoding directional information. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks, resulting in a model with constant width throughout all layers. The architecture is further simplified by removing the need for special tokens. We evaluate PlainMamba on a variety of visual recognition tasks, achieving performance gains over previous non-hierarchical models and is competitive with hierarchical alternatives. For tasks requiring high-resolution inputs, in particular, PlainMamba requires much less computing while maintaining high performance. Code and models are available at: https://github.com/ChenhongyiYang/PlainMamba .
Paper Structure (14 sections, 5 equations, 4 figures, 8 tables)

This paper contains 14 sections, 5 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: While hierarchical visual encoders may demonstrate superior accuracy on open-source visual recognition benchmarks, the plain non-hierarchical models have had more widespread use because of their simple structure. We investigate the potential of the plain Mamba model in visual recognition.
  • Figure 2: (a) The overall architecture of the proposed PlainMamba. PlainMamba does not have a hierarchical structure, it instead stacks $L$ identical PlainMamba block to form the main network. For image classification, it uses global average pooling instead of the CLS to gather global information. (b) Architecture of PlainMamba block, which is similar to the Mamba mamba block where the selective scanning is combined with a gated MLP. (c) The proposed Direction-Aware Updating, where a series of learnable parameters $\mathbf{\Theta}_k$ are combined with the data-dependent updating parameters to explicitly inject relative 2D positional information into the selective scanning process.
  • Figure 3: Comparison between our Continuous 2D Scanning and the selective scan orders in ViM zhu2024vision and VMamba liu2024vmamba. Our method makes sure that every scanned visual token is spatially adjacent to its predecessor, avoiding potential spatial and semantic discontinuity.
  • Figure 4: Efficiency comparison between PlainMamba and DeiT. We modify the DeiT-Tiny model by changing its channel number to 224, resulting in a similar-size model (7.4M) to PlainMamba-L1. The peak memory is measured using a batch size of 1.