Table of Contents
Fetching ...

DefMamba: Deformable Visual State Space Model

Leiye Liu, Miao Zhang, Jihao Yin, Tingwei Liu, Wei Ji, Yongri Piao, Huchuan Lu

TL;DR

DefMamba introduces a deformable scanning strategy and a deformable state-space model to enable structure-aware feature extraction in vision SSM backbones. By learning deformable reference points and a content-adaptive token order, it preserves spatial information and prioritizes informative regions, improving efficiency and accuracy across ImageNet classification, COCO detection/segmentation, and ADE20K segmentation. Extensive experiments and ablations demonstrate that DefMamba outperforms prior SSM-based methods and remains competitive with CNN and Transformer baselines while reducing computational burden in several settings. This work advances visual foundation models by integrating deformable mechanisms with state-space dynamics to align feature processing with object structure and detail changes in diverse visual tasks.

Abstract

Recently, state space models (SSM), particularly Mamba, have attracted significant attention from scholars due to their ability to effectively balance computational efficiency and performance. However, most existing visual Mamba methods flatten images into 1D sequences using predefined scan orders, which results the model being less capable of utilizing the spatial structural information of the image during the feature extraction process. To address this issue, we proposed a novel visual foundation model called DefMamba. This model includes a multi-scale backbone structure and deformable mamba (DM) blocks, which dynamically adjust the scanning path to prioritize important information, thus enhancing the capture and processing of relevant input features. By combining a deformable scanning(DS) strategy, this model significantly improves its ability to learn image structures and detects changes in object details. Numerous experiments have shown that DefMamba achieves state-of-the-art performance in various visual tasks, including image classification, object detection, instance segmentation, and semantic segmentation. The code is open source on DefMamba.

DefMamba: Deformable Visual State Space Model

TL;DR

DefMamba introduces a deformable scanning strategy and a deformable state-space model to enable structure-aware feature extraction in vision SSM backbones. By learning deformable reference points and a content-adaptive token order, it preserves spatial information and prioritizes informative regions, improving efficiency and accuracy across ImageNet classification, COCO detection/segmentation, and ADE20K segmentation. Extensive experiments and ablations demonstrate that DefMamba outperforms prior SSM-based methods and remains competitive with CNN and Transformer baselines while reducing computational burden in several settings. This work advances visual foundation models by integrating deformable mechanisms with state-space dynamics to align feature processing with object structure and detail changes in diverse visual tasks.

Abstract

Recently, state space models (SSM), particularly Mamba, have attracted significant attention from scholars due to their ability to effectively balance computational efficiency and performance. However, most existing visual Mamba methods flatten images into 1D sequences using predefined scan orders, which results the model being less capable of utilizing the spatial structural information of the image during the feature extraction process. To address this issue, we proposed a novel visual foundation model called DefMamba. This model includes a multi-scale backbone structure and deformable mamba (DM) blocks, which dynamically adjust the scanning path to prioritize important information, thus enhancing the capture and processing of relevant input features. By combining a deformable scanning(DS) strategy, this model significantly improves its ability to learn image structures and detects changes in object details. Numerous experiments have shown that DefMamba achieves state-of-the-art performance in various visual tasks, including image classification, object detection, instance segmentation, and semantic segmentation. The code is open source on DefMamba.

Paper Structure

This paper contains 18 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparison of deformable scanning and other scanning methods. (a) Raster scanning vimvmamba, (b) Local scanning localmamba, (c) Continuous scanning plainmamba, (d) Our designed deformable scanning. The blue dots represent the reference points, and the red dots represent the deformable points. The yellow arrows represent the fixed scan order, and the red gradient arrows represent the deformable scan order. Our method exhibits an enhanced capacity to accurately capture the structural characteristics of objects, thereby enabling the development of a more refined scanning approach.
  • Figure 2: Overview of DefMamba. (a) depicts the overall architecture of our network. (b) illustrates the structure of the deformable Mamba block. LN means LayerNorm and FFN is a feed-forward network.
  • Figure 3: Illustration of Deformable State Space Model. (a) illustrates the processing flow of the deformable state space model for feature extraction. (b) depicts the processing flow of the deformable scan. The upper part primarily shifts the feature points to enable the model to focus on more salient regions, while the lower part shifts the token positions to facilitate the discovery of a scanning order that is better suited to the current input. To clearly illustrate the process, only nine points are depicted in the figure, however, the actual processing involves a greater number of points. (c) presents the detailed structure of the offset network.
  • Figure 4: Visualization of activation maps in the specific position. The position is marker by red and orange point. RS stands for raster scanning, DS stands for our deformable scanning.
  • Figure 5: Visualization of deformable points and deformable token index. In (a), the orange dots represent deformable points, the green dots represent reference points, and the red arrows represent the offset path of the points. In (b) and (c), the gradient from yellow to green represents the scanning path, with the yellow dots being scanned first and the green dots being scanned later.
  • ...and 1 more figures