Dynamic Vision Mamba
Mengxuan Wu, Zekai Li, Zhiyuan Liang, Moyang Li, Xuanlei Zhao, Samir Khaki, Zheng Zhu, Xiaojiang Peng, Konstantinos N. Plataniotis, Kai Wang, Wangbo Zhao, Yang You
TL;DR
Dynamic Vision Mamba (DyVM) addresses token- and block-level redundancy in Vision Mamba architectures by introducing multi-stage, learnable token pruning with a rearrangement strategy that preserves training-inference consistency and a per-layer dynamic block selector to adaptively choose SSM blocks per input. The approach employs a joint training objective combining classification, pruning-ratio supervision, block-ratio supervision, and distillation losses to calibrate pruning and maintain accuracy. Empirical results on ImageNet-1K with Vim variants show up to 35.2% FLOPs reduction at a 1.7% accuracy drop, and DyVM demonstrates strong generalization to VideoMamba and MambaReg, as well as to semantic segmentation on ADE20K. Overall, DyVM provides a practical, scalable pathway to more efficient Mamba-based vision systems while preserving performance and offering insights for future architecture design.
Abstract
Mamba-based vision models have gained extensive attention as a result of being computationally more efficient than attention-based models. However, spatial redundancy still exists in these models, represented by token and block redundancy. For token redundancy, we analytically find that early token pruning methods will result in inconsistency between training and inference or introduce extra computation for inference. Therefore, we customize token pruning to fit the Mamba structure by rearranging the pruned sequence before feeding it into the next Mamba block. For block redundancy, we allow each image to select SSM blocks dynamically based on an empirical observation that the inference speed of Mamba-based vision models is largely affected by the number of SSM blocks. Our proposed method, Dynamic Vision Mamba (DyVM), effectively reduces FLOPs with minor performance drops. We achieve a reduction of 35.2\% FLOPs with only a loss of accuracy of 1.7\% on Vim-S. It also generalizes well across different Mamba vision model architectures and different vision tasks. Our code will be made public.
