Table of Contents
Fetching ...

VSSD: Vision Mamba with Non-Causal State Space Duality

Yuheng Shi, Minjing Dong, Mingjia Li, Chang Xu

TL;DR

This work tackles the handicap of causal processing in State Space Models for vision by introducing Non-Causal State Space Duality (NC-SSD) and the Visual State Space Duality (VSSD) backbone. By treating the state transition as a scalar and deriving a global hidden state via bidirectional scanning, NC-SSD removes the causal mask while preserving global receptive field and linear complexity. Building on this, VSSD integrates NC-SSD blocks with FFNs, local perception units, and selective self-attention in a hierarchical backbone, achieving state-of-the-art results among SSM-based Vision models on ImageNet-1K, COCO, and ADE20K while offering substantial training speedups. Ablations reveal the pivotal role of the weighting vector m and demonstrate that the approach effectively preserves 2D patch relationships without resorting to extensive multi-scan strategies. Overall, NC-SSD/VSSD provides a strong, efficient alternative to ViTs and CNNs for a range of vision tasks, with public code to support reproducibility.

Abstract

Vision transformers have significantly advanced the field of computer vision, offering robust modeling capabilities and global receptive field. However, their high computational demands limit their applicability in processing long sequences. To tackle this issue, State Space Models (SSMs) have gained prominence in vision tasks as they offer linear computational complexity. Recently, State Space Duality (SSD), an improved variant of SSMs, was introduced in Mamba2 to enhance model performance and efficiency. However, the inherent causal nature of SSD/SSMs restricts their applications in non-causal vision tasks. To address this limitation, we introduce Visual State Space Duality (VSSD) model, which has a non-causal format of SSD. Specifically, we propose to discard the magnitude of interactions between the hidden state and tokens while preserving their relative weights, which relieves the dependencies of token contribution on previous tokens. Together with the involvement of multi-scan strategies, we show that the scanning results can be integrated to achieve non-causality, which not only improves the performance of SSD in vision tasks but also enhances its efficiency. We conduct extensive experiments on various benchmarks including image classification, detection, and segmentation, where VSSD surpasses existing state-of-the-art SSM-based models. Code and weights are available at \url{https://github.com/YuHengsss/VSSD}.

VSSD: Vision Mamba with Non-Causal State Space Duality

TL;DR

This work tackles the handicap of causal processing in State Space Models for vision by introducing Non-Causal State Space Duality (NC-SSD) and the Visual State Space Duality (VSSD) backbone. By treating the state transition as a scalar and deriving a global hidden state via bidirectional scanning, NC-SSD removes the causal mask while preserving global receptive field and linear complexity. Building on this, VSSD integrates NC-SSD blocks with FFNs, local perception units, and selective self-attention in a hierarchical backbone, achieving state-of-the-art results among SSM-based Vision models on ImageNet-1K, COCO, and ADE20K while offering substantial training speedups. Ablations reveal the pivotal role of the weighting vector m and demonstrate that the approach effectively preserves 2D patch relationships without resorting to extensive multi-scan strategies. Overall, NC-SSD/VSSD provides a strong, efficient alternative to ViTs and CNNs for a range of vision tasks, with public code to support reproducibility.

Abstract

Vision transformers have significantly advanced the field of computer vision, offering robust modeling capabilities and global receptive field. However, their high computational demands limit their applicability in processing long sequences. To tackle this issue, State Space Models (SSMs) have gained prominence in vision tasks as they offer linear computational complexity. Recently, State Space Duality (SSD), an improved variant of SSMs, was introduced in Mamba2 to enhance model performance and efficiency. However, the inherent causal nature of SSD/SSMs restricts their applications in non-causal vision tasks. To address this limitation, we introduce Visual State Space Duality (VSSD) model, which has a non-causal format of SSD. Specifically, we propose to discard the magnitude of interactions between the hidden state and tokens while preserving their relative weights, which relieves the dependencies of token contribution on previous tokens. Together with the involvement of multi-scan strategies, we show that the scanning results can be integrated to achieve non-causality, which not only improves the performance of SSD in vision tasks but also enhances its efficiency. We conduct extensive experiments on various benchmarks including image classification, detection, and segmentation, where VSSD surpasses existing state-of-the-art SSM-based models. Code and weights are available at \url{https://github.com/YuHengsss/VSSD}.
Paper Structure (14 sections, 10 equations, 5 figures, 7 tables)

This paper contains 14 sections, 10 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: (a) Two challenges when applying SSM/SSD to image data. (b) and (c) are comparisons on ImageNet. Our VSSD model achieves leading accuracy and efficiency compared to CNN-based ConvNeXt convnext, ViT-based Swin Transformer Swin, and SSM-based VMamba liu2024vmamba. The latency of all models is measured on an A100 GPU using a batch size of 128 and FP16 precision.
  • Figure 2: Illustration of the Hidden State Generation Process for SSD and NC-SSD. During the hidden state update process, NC-SSD utilizes the scalar $A$ to determine the extent of information increment for the current token, in contrast to SSD where $A$ dictates the proportion of the hidden state to be retained. Unlike the SSD, which generates token-wise hidden states, the NC-SSD produces only a global hidden state to accommodate non-causal image data.
  • Figure 3: Visualization of input images alongside their corresponding heat maps, which are derived by averaging the vector $\mathbf{m}$ across various heads in the NC-SSD.
  • Figure 4: Overall Architecture of the Proposed VSSD Model. The VSSD model initiates with a series of overlapping convolutions serving as the stem, followed by four progressive stages of processing. First three stages are equipped with VSSD Block, which is elaborated in the lower part of the figure, comprising a NC-SSD block and a FFN. Local Perception Units (LPU) are omitted in this visualization for brevity.
  • Figure 5: Comparison of the Effective Receptive Field (ERF) among our VSSD, CNN-based models (ResNet resnet and ConvNeXt convnext), attention-based models (Swin Swin and DeiT DeiT2021), and the SSM-based VMamba liu2024vmamba. Our VSSD effectively eliminates the impact of token spacing on the contribution of information compared to SSM-based VMamba.