Table of Contents
Fetching ...

Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training

Mingjia Shi, Yuhao Zhou, Ruiji Yu, Zekai Li, Zhiyuan Liang, Xuanlei Zhao, Xiaojiang Peng, Shanmukha Ramakrishna Vedantam, Wangbo Zhao, Kai Wang, Yang You

TL;DR

This paper investigates token reduction in Vision Mamba and identifies that pruning informative tokens can cause disproportionate loss of both general and specific knowledge due to Mamba's sequential dependencies. It introduces R-MeeTo, a two-stage framework that first merges tokens to preserve information and then retrains briefly to rebuild key knowledge, achieving near-state-of-the-art accuracy at significantly reduced compute. The authors formalize the information-flow differences between Attention-based transformers and SSM-based Mamba, presenting an enrichment theorem that explains why Mamba tokens near sequence boundaries carry more general knowledge and why merging is preferable to pruning. Empirical results on ImageNet-1K across Vim-Ti/S/B (and VideoMamba) show that R-MeeTo recovers performance with up to 0.9% accuracy loss and can re-train in minutes, delivering notable inference-speedups (up to 1.5x) while maintaining high accuracy. Overall, R-MeeTo offers a practical, scalable path to accelerating Vision Mamba deployments with minimal retraining cost and broad hardware compatibility.

Abstract

Vision Mamba has shown close to state of the art performance on computer vision tasks, drawing much interest in increasing it's efficiency. A promising approach is token reduction (that has been successfully implemented in ViTs). Pruning informative tokens in Mamba leads to a high loss of key knowledge and degraded performance. An alternative, of merging tokens preserves more information than pruning, also suffers for large compression ratios. Our key insight is that a quick round of retraining after token merging yeilds robust results across various compression ratios. Empirically, pruned Vims only drop up to 0.9% accuracy on ImageNet-1K, recovered by our proposed framework R-MeeTo in our main evaluation. We show how simple and effective the fast recovery can be achieved at minute-level, in particular, a 35.9% accuracy spike over 3 epochs of training on Vim-Ti. Moreover, Vim-Ti/S/B are re-trained within 5/7/17 minutes, and Vim-S only drops 1.3% with 1.2x (up to 1.5x) speed up in inference.

Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training

TL;DR

This paper investigates token reduction in Vision Mamba and identifies that pruning informative tokens can cause disproportionate loss of both general and specific knowledge due to Mamba's sequential dependencies. It introduces R-MeeTo, a two-stage framework that first merges tokens to preserve information and then retrains briefly to rebuild key knowledge, achieving near-state-of-the-art accuracy at significantly reduced compute. The authors formalize the information-flow differences between Attention-based transformers and SSM-based Mamba, presenting an enrichment theorem that explains why Mamba tokens near sequence boundaries carry more general knowledge and why merging is preferable to pruning. Empirical results on ImageNet-1K across Vim-Ti/S/B (and VideoMamba) show that R-MeeTo recovers performance with up to 0.9% accuracy loss and can re-train in minutes, delivering notable inference-speedups (up to 1.5x) while maintaining high accuracy. Overall, R-MeeTo offers a practical, scalable path to accelerating Vision Mamba deployments with minimal retraining cost and broad hardware compatibility.

Abstract

Vision Mamba has shown close to state of the art performance on computer vision tasks, drawing much interest in increasing it's efficiency. A promising approach is token reduction (that has been successfully implemented in ViTs). Pruning informative tokens in Mamba leads to a high loss of key knowledge and degraded performance. An alternative, of merging tokens preserves more information than pruning, also suffers for large compression ratios. Our key insight is that a quick round of retraining after token merging yeilds robust results across various compression ratios. Empirically, pruned Vims only drop up to 0.9% accuracy on ImageNet-1K, recovered by our proposed framework R-MeeTo in our main evaluation. We show how simple and effective the fast recovery can be achieved at minute-level, in particular, a 35.9% accuracy spike over 3 epochs of training on Vim-Ti. Moreover, Vim-Ti/S/B are re-trained within 5/7/17 minutes, and Vim-S only drops 1.3% with 1.2x (up to 1.5x) speed up in inference.

Paper Structure

This paper contains 76 sections, 7 theorems, 22 equations, 6 figures, 22 tables, 4 algorithms.

Key Result

Theorem 1

(Enrichment effect in Mamba.) Under Assumption assum:mtd_equal_comp_info and Assumption assum:mtd_equal_knowledge, we have the following relationship between Attention Block and SSM.

Figures (6)

  • Figure 1: Performance comparison w.r.t. reduction ratio: a) Transformer and Mamba in token pruning; b) Merging and pruning with Mamba. Transformer and Mamba are respectively DeiT-S deit/Vim-S vision_mamba, tested on ImageNet-1K imagenet.
  • Figure 2: Analysis' sketch: Mamba is sensitive to token reduction.
  • Figure 3: Supporting facts. 1) The empirical results of $I(X;Y)$, the mutual information between inputs $X$ and outputs $Y$. Mamba is sensitive to token order. 2) Only Mamba's performance drops if we further Shuffle Tokens before re-training. The Attention Block and SSM are measured by MINE mine on the middle layers of DeiT-S and Vim-S ($7\text{-th}/12$ layers and the $14\text{-th}/24$ layers respectively.) Experiments about i) token reduction are conducted with DeiT-S deit (Transformer) and Vim-S vision_mamba (Mamba) on ImageNet-1K imagenet. The reduction ratios in the experiment about ii) shuffled tokens are 0.14 for Vim-Ti and 0.31 for Vim-S/Vim-B (see Sec. \ref{['3_exp_ablation']} for more details about ablation). Shuffle strategy is odd-even shuffle: [0,1,2,3]→[0,2], [1,3]→[0,2,1,3].
  • Figure 4: Throughput and top-1 accuracy comparison of Vim-S using R-MeeTo across different reduction ratios and GPUs. R-MeeTo effectively optimizes inference speed while preserving strong model accuracy across various hardware platforms. Notably, the performance drop at the reduction ratio of 0.14 results from I/O and additional computational overhead outweighing the benefits of token reduction. +
  • Figure 5: Visualization of R-MeeTo on ImageNet-1K imagenet. Tokens belonging to one object are merged into one.
  • ...and 1 more figures

Theorems & Definitions (13)

  • Remark 1
  • Theorem 1
  • Corollary 1
  • Remark 2
  • Proposition 1
  • Corollary 2
  • Corollary 3
  • Proposition 2
  • Definition 1
  • Definition 2
  • ...and 3 more