Table of Contents
Fetching ...

MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba

Shanhui Liu, Rui Xu, Yunke Wang

TL;DR

Vision Mamba's efficiency is limited by token length; MambaScope introduces a coarse-to-fine adaptive inference framework that processes images in large patches and selectively refines informative regions at higher resolution when confidence is low. It leverages token importance scores derived from bidirectional state-space models and reuses coarse-stage features to minimize overhead, with a training objective that encourages coarse outputs to align with fine outputs. Across ImageNet-1K, ADE20K, and COCO-2017, MambaScope achieves substantial FLOPs reductions while maintaining or surpassing accuracy versus ViM and other token-reduction baselines. The results demonstrate flexible, per-image computation budgeting and robust performance gains across classification and dense prediction tasks.

Abstract

Vision Mamba has emerged as a promising and efficient alternative to Vision Transformers, yet its efficiency remains fundamentally constrained by the number of input tokens. Existing token reduction approaches typically adopt token pruning or merging to reduce computation. However, they inherently lead to information loss as they discard or compress token representations. This problem is further exacerbated when the same fine-grained token processing is uniformly applied across all images regardless of visual complexity. We observe that not all inputs require fine-grained processing: simple images can be effectively handled at a coarse resolution, while only complex ones require refinement. Based on this insight, we propose MambaScope, an adaptive framework for efficient inference for Vision Mamba. MambaScope first performs coarse-grained inference by dividing the input image into large patches, significantly reducing token length and computation. When the model's prediction confidence is low, selected regions are re-processed at a finer resolution to recover essential visual details with minimal additional cost. This dynamic resolution assignment strategy allows MambaScope to allocate computation adaptively according to image complexity, achieving efficient processing without compromising accuracy. Experiments across various vision tasks demonstrate that MambaScope outperforms both the baseline Vision Mamba and state-of-the-art token reduction techniques in terms of accuracy and efficiency.

MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba

TL;DR

Vision Mamba's efficiency is limited by token length; MambaScope introduces a coarse-to-fine adaptive inference framework that processes images in large patches and selectively refines informative regions at higher resolution when confidence is low. It leverages token importance scores derived from bidirectional state-space models and reuses coarse-stage features to minimize overhead, with a training objective that encourages coarse outputs to align with fine outputs. Across ImageNet-1K, ADE20K, and COCO-2017, MambaScope achieves substantial FLOPs reductions while maintaining or surpassing accuracy versus ViM and other token-reduction baselines. The results demonstrate flexible, per-image computation budgeting and robust performance gains across classification and dense prediction tasks.

Abstract

Vision Mamba has emerged as a promising and efficient alternative to Vision Transformers, yet its efficiency remains fundamentally constrained by the number of input tokens. Existing token reduction approaches typically adopt token pruning or merging to reduce computation. However, they inherently lead to information loss as they discard or compress token representations. This problem is further exacerbated when the same fine-grained token processing is uniformly applied across all images regardless of visual complexity. We observe that not all inputs require fine-grained processing: simple images can be effectively handled at a coarse resolution, while only complex ones require refinement. Based on this insight, we propose MambaScope, an adaptive framework for efficient inference for Vision Mamba. MambaScope first performs coarse-grained inference by dividing the input image into large patches, significantly reducing token length and computation. When the model's prediction confidence is low, selected regions are re-processed at a finer resolution to recover essential visual details with minimal additional cost. This dynamic resolution assignment strategy allows MambaScope to allocate computation adaptively according to image complexity, achieving efficient processing without compromising accuracy. Experiments across various vision tasks demonstrate that MambaScope outperforms both the baseline Vision Mamba and state-of-the-art token reduction techniques in terms of accuracy and efficiency.

Paper Structure

This paper contains 18 sections, 21 equations, 9 figures, 7 tables, 2 algorithms.

Figures (9)

  • Figure 1: Comparison of coarse scope (7$\times$7), fine scope (14$\times$14) and coarse-to-fine scope in Vision Mamba. For simplicity, we use less patches here for better visualization. While fine-grained processing yields higher accuracy, it incurs significantly higher computational cost. In contrast, coarse-to-fine approach selectively applies fine-grained analysis, achieving a better accuracy-efficiency trade-off by focusing computation on informative regions.
  • Figure 2: Inference pipeline of MambaScope. The input image is first processed in the coarse stage, using large patches to produce an initial output. If the model's confidence is insufficient, it proceeds to the fine stage, where it identifies informative regions, splits them into smaller patches, and reuses coarse-stage features to refine the output. Both stages share the same network parameters.
  • Figure 3: Visualization of the token-level attention map based on the proposed SSM-derived importance scores. Brighter regions indicate higher token importance, revealing the model's ability to focus on semantically salient areas.
  • Figure 4: Comparison of Top-1 accuracy versus FLOPs among other ViM optimization methods.
  • Figure 5: Ablation study on feature reuse and token arrangement strategies conducted on the miniImageNet miniimgnet dataset.
  • ...and 4 more figures