Table of Contents
Fetching ...

MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba

Zhongping Ji

TL;DR

The paper addresses the efficiency gap in visual backbones by applying State Space Models with linear complexity to 2D vision tasks. It introduces Multi-Head Scan (MHS) and Scan Route Attention (SRA) to construct 2D features from 1D sequences, replacing the SS2D block in VM-UNet. Empirical results on three medical-image segmentation datasets show improved accuracy with substantial reductions in parameter count and FLOPs, demonstrating both effectiveness and efficiency. The work points to a scalable pathway for visual backbones and motivates exploring richer scan patterns and hierarchical representations for broader tasks.

Abstract

Recently, State Space Models (SSMs), with Mamba as a prime example, have shown great promise for long-range dependency modeling with linear complexity. Then, Vision Mamba and the subsequent architectures are presented successively, and they perform well on visual tasks. The crucial step of applying Mamba to visual tasks is to construct 2D visual features in sequential manners. To effectively organize and construct visual features within the 2D image space through 1D selective scan, we propose a novel Multi-Head Scan (MHS) module. The embeddings extracted from the preceding layer are projected into multiple lower-dimensional subspaces. Subsequently, within each subspace, the selective scan is performed along distinct scan routes. The resulting sub-embeddings, obtained from the multi-head scan process, are then integrated and ultimately projected back into the high-dimensional space. Moreover, we incorporate a Scan Route Attention (SRA) mechanism to enhance the module's capability to discern complex structures. To validate the efficacy of our module, we exclusively substitute the 2D-Selective-Scan (SS2D) block in VM-UNet with our proposed module, and we train our models from scratch without using any pre-trained weights. The results indicate a significant improvement in performance while reducing the parameters of the original VM-UNet. The code for this study is publicly available at https://github.com/PixDeep/MHS-VM.

MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba

TL;DR

The paper addresses the efficiency gap in visual backbones by applying State Space Models with linear complexity to 2D vision tasks. It introduces Multi-Head Scan (MHS) and Scan Route Attention (SRA) to construct 2D features from 1D sequences, replacing the SS2D block in VM-UNet. Empirical results on three medical-image segmentation datasets show improved accuracy with substantial reductions in parameter count and FLOPs, demonstrating both effectiveness and efficiency. The work points to a scalable pathway for visual backbones and motivates exploring richer scan patterns and hierarchical representations for broader tasks.

Abstract

Recently, State Space Models (SSMs), with Mamba as a prime example, have shown great promise for long-range dependency modeling with linear complexity. Then, Vision Mamba and the subsequent architectures are presented successively, and they perform well on visual tasks. The crucial step of applying Mamba to visual tasks is to construct 2D visual features in sequential manners. To effectively organize and construct visual features within the 2D image space through 1D selective scan, we propose a novel Multi-Head Scan (MHS) module. The embeddings extracted from the preceding layer are projected into multiple lower-dimensional subspaces. Subsequently, within each subspace, the selective scan is performed along distinct scan routes. The resulting sub-embeddings, obtained from the multi-head scan process, are then integrated and ultimately projected back into the high-dimensional space. Moreover, we incorporate a Scan Route Attention (SRA) mechanism to enhance the module's capability to discern complex structures. To validate the efficacy of our module, we exclusively substitute the 2D-Selective-Scan (SS2D) block in VM-UNet with our proposed module, and we train our models from scratch without using any pre-trained weights. The results indicate a significant improvement in performance while reducing the parameters of the original VM-UNet. The code for this study is publicly available at https://github.com/PixDeep/MHS-VM.
Paper Structure (12 sections, 10 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 12 sections, 10 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: The architecture of our Multi-Head Scan (MHS) modules. In the illustrated modules, there are three scan headers, and this quantity can be adjusted to suit practical requirements. This design facilitates its application, and we can immediately replace the SS2D module in VSS block of VM-UNet with our MHS module.
  • Figure 2: Illustration of four scan patterns. From left to right: the traversal paths of image patches for four scan patterns. The number marked along the dotted line indicates the traversal order of the patches along a traversal path.
  • Figure 3: Illustration of image patch sequences. Each row displays a 1D sequence of the image patches spread out along a traverse path shown in Figure. \ref{['fig:scanpatterns']}.
  • Figure 4: Illustration of four scan routes sampled in the third type of scan pattern.
  • Figure 5: Illustrations of two schemes for the ESF sub-module. (a) Mixture of Poolings; (b) CV-guided Scaling.