Table of Contents
Fetching ...

DAMamba: Vision State Space Model with Dynamic Adaptive Scan

Tanzhe Li, Caoshuo Li, Jiayi Lyu, Hongjuan Pei, Baochang Zhang, Taisong Jin, Rongrong Ji

TL;DR

The paper tackles the challenge that vision state-space models (SSMs) have lagged behind CNNs and ViTs due to fixed, manually designed scanning patterns for 2D images. It introduces Dynamic Adaptive Scan (DAS), a data-driven mechanism that predicts patch coordinate offsets with an Offset Prediction Network and samples via bilinear interpolation to form input-dependent scan sequences, maintaining linear complexity $O(L)$. Built on DAS, the DAMamba backbone delivers strong results across ImageNet-1K, COCO, and ADE20K for classification, detection, and segmentation, outperforming prior vision SSMs and competing with leading CNNs and ViTs. The work demonstrates the power of adaptive scanning in vision SSMs, offering a flexible, efficient backbone that can serve diverse vision tasks and set new benchmarks for performance and efficiency.

Abstract

State space models (SSMs) have recently garnered significant attention in computer vision. However, due to the unique characteristics of image data, adapting SSMs from natural language processing to computer vision has not outperformed the state-of-the-art convolutional neural networks (CNNs) and Vision Transformers (ViTs). Existing vision SSMs primarily leverage manually designed scans to flatten image patches into sequences locally or globally. This approach disrupts the original semantic spatial adjacency of the image and lacks flexibility, making it difficult to capture complex image structures. To address this limitation, we propose Dynamic Adaptive Scan (DAS), a data-driven method that adaptively allocates scanning orders and regions. This enables more flexible modeling capabilities while maintaining linear computational complexity and global modeling capacity. Based on DAS, we further propose the vision backbone DAMamba, which significantly outperforms current state-of-the-art vision Mamba models in vision tasks such as image classification, object detection, instance segmentation, and semantic segmentation. Notably, it surpasses some of the latest state-of-the-art CNNs and ViTs. Code will be available at https://github.com/ltzovo/DAMamba.

DAMamba: Vision State Space Model with Dynamic Adaptive Scan

TL;DR

The paper tackles the challenge that vision state-space models (SSMs) have lagged behind CNNs and ViTs due to fixed, manually designed scanning patterns for 2D images. It introduces Dynamic Adaptive Scan (DAS), a data-driven mechanism that predicts patch coordinate offsets with an Offset Prediction Network and samples via bilinear interpolation to form input-dependent scan sequences, maintaining linear complexity . Built on DAS, the DAMamba backbone delivers strong results across ImageNet-1K, COCO, and ADE20K for classification, detection, and segmentation, outperforming prior vision SSMs and competing with leading CNNs and ViTs. The work demonstrates the power of adaptive scanning in vision SSMs, offering a flexible, efficient backbone that can serve diverse vision tasks and set new benchmarks for performance and efficiency.

Abstract

State space models (SSMs) have recently garnered significant attention in computer vision. However, due to the unique characteristics of image data, adapting SSMs from natural language processing to computer vision has not outperformed the state-of-the-art convolutional neural networks (CNNs) and Vision Transformers (ViTs). Existing vision SSMs primarily leverage manually designed scans to flatten image patches into sequences locally or globally. This approach disrupts the original semantic spatial adjacency of the image and lacks flexibility, making it difficult to capture complex image structures. To address this limitation, we propose Dynamic Adaptive Scan (DAS), a data-driven method that adaptively allocates scanning orders and regions. This enables more flexible modeling capabilities while maintaining linear computational complexity and global modeling capacity. Based on DAS, we further propose the vision backbone DAMamba, which significantly outperforms current state-of-the-art vision Mamba models in vision tasks such as image classification, object detection, instance segmentation, and semantic segmentation. Notably, it surpasses some of the latest state-of-the-art CNNs and ViTs. Code will be available at https://github.com/ltzovo/DAMamba.

Paper Structure

This paper contains 16 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The trade-off between ImageNet-1K top-1 accuracy and inference throughput. All the models are trained under the DeiT training hyperparameters. The inference throughput is measured on an NVIDIA RTX 3090 GPU with a batch size 128. It can be seen that under the same inference throughput or accuracy, the accuracy or inference throughput of the proposed DAMamba significantly outperforms the SSMs, ViTs and CNNs, indicating that the proposed DAMamba achieves state-of-the-art performance and efficiency.
  • Figure 2: Illustration of different scanning methods in vision state space models. As showed in Figure (a), (b), and (c): previous methods such as Vim vim, VMamba vmamba, PlainMamba plainmamba, and LocalMamba localmamba relies on manually designed global or local scanning methods. These fixed processing approaches lack flexibility and struggle to capture complex image structures. In the Figure (d), we propose a novel scanning method that adaptively allocates scanning order and regions through a data-driven approach. This not only achieves more flexible modeling capabilities but also maintains Mamba's linear computational complexity and global modeling capacity.
  • Figure 3: Illustration of the proposed Dynamic Adaptive Scan (DAS). For clarity, only four reference points are shown. Left: each initial reference point represents the original position of a patch, with its offsets learned by an Offset Prediction Network (OPN). Features of important regions are sampled based on the predicted 2D coordinates using bilinear interpolation. Right the detailed structure of the OPN is revealed. The query feature map is first transformed through depthwise convolution depsconvdepsconv2 to integrate local information. Then, another linear layer, after layer normalization layernorm and GELU gelu activation, converts the feature map into offset values.
  • Figure 4: Left: The overall architecture of the proposed DAMamba, refer to Table \ref{['tab:arch']} for configurations. Right: Details of an DAMamba Block.
  • Figure 5: Visualization of the Dynamic Adaptive Scan, where the blue pentagram represents the start of the scan and the blue circle represents the end of the scan.