MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba
Shanhui Liu, Rui Xu, Yunke Wang
TL;DR
Vision Mamba's efficiency is limited by token length; MambaScope introduces a coarse-to-fine adaptive inference framework that processes images in large patches and selectively refines informative regions at higher resolution when confidence is low. It leverages token importance scores derived from bidirectional state-space models and reuses coarse-stage features to minimize overhead, with a training objective that encourages coarse outputs to align with fine outputs. Across ImageNet-1K, ADE20K, and COCO-2017, MambaScope achieves substantial FLOPs reductions while maintaining or surpassing accuracy versus ViM and other token-reduction baselines. The results demonstrate flexible, per-image computation budgeting and robust performance gains across classification and dense prediction tasks.
Abstract
Vision Mamba has emerged as a promising and efficient alternative to Vision Transformers, yet its efficiency remains fundamentally constrained by the number of input tokens. Existing token reduction approaches typically adopt token pruning or merging to reduce computation. However, they inherently lead to information loss as they discard or compress token representations. This problem is further exacerbated when the same fine-grained token processing is uniformly applied across all images regardless of visual complexity. We observe that not all inputs require fine-grained processing: simple images can be effectively handled at a coarse resolution, while only complex ones require refinement. Based on this insight, we propose MambaScope, an adaptive framework for efficient inference for Vision Mamba. MambaScope first performs coarse-grained inference by dividing the input image into large patches, significantly reducing token length and computation. When the model's prediction confidence is low, selected regions are re-processed at a finer resolution to recover essential visual details with minimal additional cost. This dynamic resolution assignment strategy allows MambaScope to allocate computation adaptively according to image complexity, achieving efficient processing without compromising accuracy. Experiments across various vision tasks demonstrate that MambaScope outperforms both the baseline Vision Mamba and state-of-the-art token reduction techniques in terms of accuracy and efficiency.
