Table of Contents
Fetching ...

Dynamic Vision Mamba

Mengxuan Wu, Zekai Li, Zhiyuan Liang, Moyang Li, Xuanlei Zhao, Samir Khaki, Zheng Zhu, Xiaojiang Peng, Konstantinos N. Plataniotis, Kai Wang, Wangbo Zhao, Yang You

TL;DR

Dynamic Vision Mamba (DyVM) addresses token- and block-level redundancy in Vision Mamba architectures by introducing multi-stage, learnable token pruning with a rearrangement strategy that preserves training-inference consistency and a per-layer dynamic block selector to adaptively choose SSM blocks per input. The approach employs a joint training objective combining classification, pruning-ratio supervision, block-ratio supervision, and distillation losses to calibrate pruning and maintain accuracy. Empirical results on ImageNet-1K with Vim variants show up to 35.2% FLOPs reduction at a 1.7% accuracy drop, and DyVM demonstrates strong generalization to VideoMamba and MambaReg, as well as to semantic segmentation on ADE20K. Overall, DyVM provides a practical, scalable pathway to more efficient Mamba-based vision systems while preserving performance and offering insights for future architecture design.

Abstract

Mamba-based vision models have gained extensive attention as a result of being computationally more efficient than attention-based models. However, spatial redundancy still exists in these models, represented by token and block redundancy. For token redundancy, we analytically find that early token pruning methods will result in inconsistency between training and inference or introduce extra computation for inference. Therefore, we customize token pruning to fit the Mamba structure by rearranging the pruned sequence before feeding it into the next Mamba block. For block redundancy, we allow each image to select SSM blocks dynamically based on an empirical observation that the inference speed of Mamba-based vision models is largely affected by the number of SSM blocks. Our proposed method, Dynamic Vision Mamba (DyVM), effectively reduces FLOPs with minor performance drops. We achieve a reduction of 35.2\% FLOPs with only a loss of accuracy of 1.7\% on Vim-S. It also generalizes well across different Mamba vision model architectures and different vision tasks. Our code will be made public.

Dynamic Vision Mamba

TL;DR

Dynamic Vision Mamba (DyVM) addresses token- and block-level redundancy in Vision Mamba architectures by introducing multi-stage, learnable token pruning with a rearrangement strategy that preserves training-inference consistency and a per-layer dynamic block selector to adaptively choose SSM blocks per input. The approach employs a joint training objective combining classification, pruning-ratio supervision, block-ratio supervision, and distillation losses to calibrate pruning and maintain accuracy. Empirical results on ImageNet-1K with Vim variants show up to 35.2% FLOPs reduction at a 1.7% accuracy drop, and DyVM demonstrates strong generalization to VideoMamba and MambaReg, as well as to semantic segmentation on ADE20K. Overall, DyVM provides a practical, scalable pathway to more efficient Mamba-based vision systems while preserving performance and offering insights for future architecture design.

Abstract

Mamba-based vision models have gained extensive attention as a result of being computationally more efficient than attention-based models. However, spatial redundancy still exists in these models, represented by token and block redundancy. For token redundancy, we analytically find that early token pruning methods will result in inconsistency between training and inference or introduce extra computation for inference. Therefore, we customize token pruning to fit the Mamba structure by rearranging the pruned sequence before feeding it into the next Mamba block. For block redundancy, we allow each image to select SSM blocks dynamically based on an empirical observation that the inference speed of Mamba-based vision models is largely affected by the number of SSM blocks. Our proposed method, Dynamic Vision Mamba (DyVM), effectively reduces FLOPs with minor performance drops. We achieve a reduction of 35.2\% FLOPs with only a loss of accuracy of 1.7\% on Vim-S. It also generalizes well across different Mamba vision model architectures and different vision tasks. Our code will be made public.

Paper Structure

This paper contains 46 sections, 23 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: (a) Pixel-wise attention score statistics computed from 1,000 images, with 10 images randomly sampled per class across 100 classes in ImageNet-1K dataset. (b) FLOPs and throughput performance under different SSM block number settings in Vim.
  • Figure 2: Demonstration of three token pruning methods' training (left) and inference (right), with a 3-token sequence example where the middle token is pruned. Solid fill indicates retained tokens, unfilled elements represent masked tokens during training, and diagonal line patterns denote tokens dropped during inference.
  • Figure 3: Dynamic Vision Mamba pipeline. The predictor modules are inserted between specific mamba blocks to gradually prune redundant tokens, while block selection modules are embedded into every mamba block to select SSM blocks for each sample dynamically. With these two methods, DyVM greatly reduces the FLOPs of the model.
  • Figure 4: The trade-off between FLOPs and accuracy represents a balance between model efficiency and performance. Larger models are less impacted by pruning, indicating they have greater spatial redundancy and can tolerate more aggressive pruning.
  • Figure 5: Visualization of token pruning results. In each group of images, we show the original image, along with its hidden attention and retained tokens of each pruning stage. Pruned tokens are mostly from low-attention areas, implying their redundancy.
  • ...and 2 more figures