Table of Contents
Fetching ...

GroupMamba: Efficient Group-Based Visual State Space Model

Abdelrahman Shaker, Syed Talal Wasim, Salman Khan, Juergen Gall, Fahad Shahbaz Khan

TL;DR

GroupMamba introduces a parameter-efficient Modulated Group Mamba layer that partitions channels into four groups, assigns a directional Visual Single Selective Scan block to each group, and applies a Channel Affinity Modulation mechanism to enable cross-group communication. A distillation-based training objective stabilizes large Mamba-based models, yielding robust performance across image classification, object detection/segmentation, and semantic segmentation. Empirical results show state-of-the-art or competitive accuracy with substantially fewer parameters than prior SSMs on ImageNet-1K, MS-COCO, and ADE20K, including 83.3% top-1 accuracy on ImageNet-1K with 23M parameters. The work demonstrates that multi-directional grouping, channel-wise modulation, and knowledge distillation together enable scalable, stable, and efficient vision SSMs with strong practical impact for CV foundations and downstream tasks.

Abstract

State-space models (SSMs) have recently shown promise in capturing long-range dependencies with subquadratic computational complexity, making them attractive for various applications. However, purely SSM-based models face critical challenges related to stability and achieving state-of-the-art performance in computer vision tasks. Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability and inefficiency of large model sizes. We introduce a parameter-efficient modulated group mamba layer that divides the input channels into four groups and applies our proposed SSM-based efficient Visual Single Selective Scanning (VSSS) block independently to each group, with each VSSS block scanning in one of the four spatial directions. The Modulated Group Mamba layer also wraps the four VSSS blocks into a channel modulation operator to improve cross-channel communication. Furthermore, we introduce a distillation-based training objective to stabilize the training of large models, leading to consistent performance gains. Our comprehensive experiments demonstrate the merits of the proposed contributions, leading to superior performance over existing methods for image classification on ImageNet-1K, object detection, instance segmentation on MS-COCO, and semantic segmentation on ADE20K. Our tiny variant with 23M parameters achieves state-of-the-art performance with a classification top-1 accuracy of 83.3% on ImageNet-1K, while being 26% efficient in terms of parameters, compared to the best existing Mamba design of same model size. Code and models are available at: https://github.com/Amshaker/GroupMamba.

GroupMamba: Efficient Group-Based Visual State Space Model

TL;DR

GroupMamba introduces a parameter-efficient Modulated Group Mamba layer that partitions channels into four groups, assigns a directional Visual Single Selective Scan block to each group, and applies a Channel Affinity Modulation mechanism to enable cross-group communication. A distillation-based training objective stabilizes large Mamba-based models, yielding robust performance across image classification, object detection/segmentation, and semantic segmentation. Empirical results show state-of-the-art or competitive accuracy with substantially fewer parameters than prior SSMs on ImageNet-1K, MS-COCO, and ADE20K, including 83.3% top-1 accuracy on ImageNet-1K with 23M parameters. The work demonstrates that multi-directional grouping, channel-wise modulation, and knowledge distillation together enable scalable, stable, and efficient vision SSMs with strong practical impact for CV foundations and downstream tasks.

Abstract

State-space models (SSMs) have recently shown promise in capturing long-range dependencies with subquadratic computational complexity, making them attractive for various applications. However, purely SSM-based models face critical challenges related to stability and achieving state-of-the-art performance in computer vision tasks. Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability and inefficiency of large model sizes. We introduce a parameter-efficient modulated group mamba layer that divides the input channels into four groups and applies our proposed SSM-based efficient Visual Single Selective Scanning (VSSS) block independently to each group, with each VSSS block scanning in one of the four spatial directions. The Modulated Group Mamba layer also wraps the four VSSS blocks into a channel modulation operator to improve cross-channel communication. Furthermore, we introduce a distillation-based training objective to stabilize the training of large models, leading to consistent performance gains. Our comprehensive experiments demonstrate the merits of the proposed contributions, leading to superior performance over existing methods for image classification on ImageNet-1K, object detection, instance segmentation on MS-COCO, and semantic segmentation on ADE20K. Our tiny variant with 23M parameters achieves state-of-the-art performance with a classification top-1 accuracy of 83.3% on ImageNet-1K, while being 26% efficient in terms of parameters, compared to the best existing Mamba design of same model size. Code and models are available at: https://github.com/Amshaker/GroupMamba.
Paper Structure (22 sections, 10 equations, 7 figures, 6 tables)

This paper contains 22 sections, 10 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison in terms of Parameters vs. Top-1 Accuracy on ImageNet-1k deng2009imagenet. Our GroupMamba-B achieves superior top-1 classification accuracy while reducing parameters by 36% compared to VMamba yue2024vmamba.
  • Figure 2: Overview of the proposed method. Top Row: The overall architecture of our framework with a consistent hierarchical design comprising four stages. Bottom Row: We present (b) The design of the modulated group mamba layer. The input channels are divided into four groups with a single scanning direction for each VSSS block. This significantly reduces the computational complexity compared to the standard mamba layer, with similar performance. Channel Affinity Modulation mechanism is introduced to address the limited interactions within the VSSS blocks. (c) The design of VSSS block. It consists of Mamba block with 1D Selective Scanning block followed by FFN. (d) The four scanning directions used for the four VSSS blocks are illustrated.
  • Figure 3: Qualitative results of GroupMamba-T for object detection and instance segmentation (first row) on the MS-COCO val. set and semantic segmentation (second row) on ADE20k val. set.
  • Figure 4: Comparison of GroupMamba variants and SSM-based methods in top-1 accuracy on ImageNet-1k deng2009imagenet and computational efficiency in terms of throughput and number of parameters. The throughput (number of predicted samples per second) is measured using a single NVIDIA A100 GPU with a batch size of 128 for all methods.
  • Figure 5: Training loss visualization for GroupMamba-S with and without the proposed distilled loss.
  • ...and 2 more figures