Table of Contents
Fetching ...

Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing

Saarthak Kapse, Robin Betz, Srinivasan Sivanandan

TL;DR

FastVim advances Vision Mamba by introducing spatial pooling to reduce the recurrent computation burden in the SSM blocks, enabling near-linear scaling with image resolution. The approach maintains competitive accuracy across image classification, segmentation, detection, and microscopy tasks while delivering substantial throughput gains (up to $72.5\%$ speedup on high-resolution inputs) and reduced FLOPs. Extensions FastMaskVim and FastChannelVim broaden applicability to irregular token grids and per-channel tokenization, achieving state-of-the-art MAE-pretrained performance on ImageNet-1k and strong gains in microscopy imaging. The work demonstrates that carefully designed pooling within a Mamba-based backbone can outperform transformer baselines in long-token settings, offering practical impact for high-resolution vision, video, and gigapixel imaging domains.

Abstract

State Space Models (SSMs) with selective scan (Mamba) have been adapted into efficient vision models. Mamba, unlike Vision Transformers, achieves linear complexity for token interactions through a recurrent hidden state process. This sequential processing is enhanced by a parallel scan algorithm, which reduces the computational time of recurrent steps from $L$ sequential steps to $log(L)$ parallel steps with respect to the number of input tokens ($L$). In this work, we propose Fast Vision Mamba (FastVim), that further reduces the computational time of the SSM block by reducing the number of recurrent steps in Vision Mamba models while still retaining model performance. By alternately pooling tokens along image dimensions across Mamba blocks, we obtain a 2$\times$ reduction in the number of parallel steps in SSM block. Our model offers up to $72.5\%$ speedup in inference speed compared to baseline Vision Mamba models on high resolution (2048$\times$2048) images. Our experiments demonstrate state-of-the-art performance with dramatically improved throughput in a range of tasks such as image classification, cell perturbation prediction, segmentation, and object detection. Code is made available at https://github.com/insitro/FastVim

Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing

TL;DR

FastVim advances Vision Mamba by introducing spatial pooling to reduce the recurrent computation burden in the SSM blocks, enabling near-linear scaling with image resolution. The approach maintains competitive accuracy across image classification, segmentation, detection, and microscopy tasks while delivering substantial throughput gains (up to speedup on high-resolution inputs) and reduced FLOPs. Extensions FastMaskVim and FastChannelVim broaden applicability to irregular token grids and per-channel tokenization, achieving state-of-the-art MAE-pretrained performance on ImageNet-1k and strong gains in microscopy imaging. The work demonstrates that carefully designed pooling within a Mamba-based backbone can outperform transformer baselines in long-token settings, offering practical impact for high-resolution vision, video, and gigapixel imaging domains.

Abstract

State Space Models (SSMs) with selective scan (Mamba) have been adapted into efficient vision models. Mamba, unlike Vision Transformers, achieves linear complexity for token interactions through a recurrent hidden state process. This sequential processing is enhanced by a parallel scan algorithm, which reduces the computational time of recurrent steps from sequential steps to parallel steps with respect to the number of input tokens (). In this work, we propose Fast Vision Mamba (FastVim), that further reduces the computational time of the SSM block by reducing the number of recurrent steps in Vision Mamba models while still retaining model performance. By alternately pooling tokens along image dimensions across Mamba blocks, we obtain a 2 reduction in the number of parallel steps in SSM block. Our model offers up to speedup in inference speed compared to baseline Vision Mamba models on high resolution (20482048) images. Our experiments demonstrate state-of-the-art performance with dramatically improved throughput in a range of tasks such as image classification, cell perturbation prediction, segmentation, and object detection. Code is made available at https://github.com/insitro/FastVim

Paper Structure

This paper contains 25 sections, 4 equations, 13 figures, 13 tables, 1 algorithm.

Figures (13)

  • Figure 1: FastVim accelerates Vim by mean pooling tokens across columns or rows, transforming token scaling from quadratic to linear with resolution. FastVim requires $log(h)$ parallel steps, compared to Vim's $log(h^2)$ parallel steps in Mamba's contextualization module SSM where $h$ is the number of tokens along height or width of the image with $L = h^2$ token inputs to the model.
  • Figure 2: Overview of FastVim: Input image tokens are fed to norm and expansion layers, then output $x$ is transposed ($T$) every block for alternate pooling of rows and columns. Tokens are pooled post-Conv1D, processed by SSM, and decompressed before skip-connection ($\mathbf{D}$ in eq. \ref{['eq:dtm']}). Note that the flattened tokens are reshaped into a 2D grid prior to the transpose and pooling layers, and are flattened again after these operations. In c) we illustrate the comparison of Forward SSM + Backward SSM inference time in one layer of Vision Mamba with Forward SSM + Backward SSM + pooling + repeat inference time in one layer of Fast Vision Mamba. We observe that with increase in resolution, FastVim needs significantly less time than Vim for contextualization module (further detailed in Supplement \ref{['additional_throughput']}).
  • Figure 3: Stability Issue in Vim-B on ImageNet-1k.
  • Figure 4: Comparison of FLOPs (G) for FastVim, Vim, and ViT across different resolutions.
  • Figure 5: Comparison of Inference Throughput (it/s) for FastVim, Vim, and ViT across different resolutions. Tested on H100 GPUs with batch size of 128.
  • ...and 8 more figures