Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training
Mingjia Shi, Yuhao Zhou, Ruiji Yu, Zekai Li, Zhiyuan Liang, Xuanlei Zhao, Xiaojiang Peng, Shanmukha Ramakrishna Vedantam, Wangbo Zhao, Kai Wang, Yang You
TL;DR
This paper investigates token reduction in Vision Mamba and identifies that pruning informative tokens can cause disproportionate loss of both general and specific knowledge due to Mamba's sequential dependencies. It introduces R-MeeTo, a two-stage framework that first merges tokens to preserve information and then retrains briefly to rebuild key knowledge, achieving near-state-of-the-art accuracy at significantly reduced compute. The authors formalize the information-flow differences between Attention-based transformers and SSM-based Mamba, presenting an enrichment theorem that explains why Mamba tokens near sequence boundaries carry more general knowledge and why merging is preferable to pruning. Empirical results on ImageNet-1K across Vim-Ti/S/B (and VideoMamba) show that R-MeeTo recovers performance with up to 0.9% accuracy loss and can re-train in minutes, delivering notable inference-speedups (up to 1.5x) while maintaining high accuracy. Overall, R-MeeTo offers a practical, scalable path to accelerating Vision Mamba deployments with minimal retraining cost and broad hardware compatibility.
Abstract
Vision Mamba has shown close to state of the art performance on computer vision tasks, drawing much interest in increasing it's efficiency. A promising approach is token reduction (that has been successfully implemented in ViTs). Pruning informative tokens in Mamba leads to a high loss of key knowledge and degraded performance. An alternative, of merging tokens preserves more information than pruning, also suffers for large compression ratios. Our key insight is that a quick round of retraining after token merging yeilds robust results across various compression ratios. Empirically, pruned Vims only drop up to 0.9% accuracy on ImageNet-1K, recovered by our proposed framework R-MeeTo in our main evaluation. We show how simple and effective the fast recovery can be achieved at minute-level, in particular, a 35.9% accuracy spike over 3 epochs of training on Vim-Ti. Moreover, Vim-Ti/S/B are re-trained within 5/7/17 minutes, and Vim-S only drops 1.3% with 1.2x (up to 1.5x) speed up in inference.
