Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing

Saarthak Kapse; Robin Betz; Srinivasan Sivanandan

Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing

Saarthak Kapse, Robin Betz, Srinivasan Sivanandan

TL;DR

FastVim advances Vision Mamba by introducing spatial pooling to reduce the recurrent computation burden in the SSM blocks, enabling near-linear scaling with image resolution. The approach maintains competitive accuracy across image classification, segmentation, detection, and microscopy tasks while delivering substantial throughput gains (up to $72.5\%$ speedup on high-resolution inputs) and reduced FLOPs. Extensions FastMaskVim and FastChannelVim broaden applicability to irregular token grids and per-channel tokenization, achieving state-of-the-art MAE-pretrained performance on ImageNet-1k and strong gains in microscopy imaging. The work demonstrates that carefully designed pooling within a Mamba-based backbone can outperform transformer baselines in long-token settings, offering practical impact for high-resolution vision, video, and gigapixel imaging domains.

Abstract

State Space Models (SSMs) with selective scan (Mamba) have been adapted into efficient vision models. Mamba, unlike Vision Transformers, achieves linear complexity for token interactions through a recurrent hidden state process. This sequential processing is enhanced by a parallel scan algorithm, which reduces the computational time of recurrent steps from $L$ sequential steps to $log(L)$ parallel steps with respect to the number of input tokens ($L$). In this work, we propose Fast Vision Mamba (FastVim), that further reduces the computational time of the SSM block by reducing the number of recurrent steps in Vision Mamba models while still retaining model performance. By alternately pooling tokens along image dimensions across Mamba blocks, we obtain a 2$\times$ reduction in the number of parallel steps in SSM block. Our model offers up to $72.5\%$ speedup in inference speed compared to baseline Vision Mamba models on high resolution (2048$\times$2048) images. Our experiments demonstrate state-of-the-art performance with dramatically improved throughput in a range of tasks such as image classification, cell perturbation prediction, segmentation, and object detection. Code is made available at https://github.com/insitro/FastVim

Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing

TL;DR

speedup on high-resolution inputs) and reduced FLOPs. Extensions FastMaskVim and FastChannelVim broaden applicability to irregular token grids and per-channel tokenization, achieving state-of-the-art MAE-pretrained performance on ImageNet-1k and strong gains in microscopy imaging. The work demonstrates that carefully designed pooling within a Mamba-based backbone can outperform transformer baselines in long-token settings, offering practical impact for high-resolution vision, video, and gigapixel imaging domains.

Abstract

sequential steps to

parallel steps with respect to the number of input tokens (

). In this work, we propose Fast Vision Mamba (FastVim), that further reduces the computational time of the SSM block by reducing the number of recurrent steps in Vision Mamba models while still retaining model performance. By alternately pooling tokens along image dimensions across Mamba blocks, we obtain a 2

reduction in the number of parallel steps in SSM block. Our model offers up to

speedup in inference speed compared to baseline Vision Mamba models on high resolution (2048

2048) images. Our experiments demonstrate state-of-the-art performance with dramatically improved throughput in a range of tasks such as image classification, cell perturbation prediction, segmentation, and object detection. Code is made available at https://github.com/insitro/FastVim

Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing

TL;DR

Abstract

Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)