Table of Contents
Fetching ...

LBMamba: Locally Bi-directional Mamba

Jingwei Zhang, Xi Han, Hong Qin, Mahdi S. Hosseini, Dimitris Samaras

TL;DR

LBMamba introduces a locally bi-directional state-space block that embeds a backward pass inside the forward scan to avoid costly global backward sweeps, and LBVim, a vision backbone that alternates scan directions every two layers to restore global receptive fields. The approach achieves superior throughput-accuracy trade-offs on natural images and giga-pixel WSIs, improving ImageNet top-1, ADE20K mIoU, COCO APm/b, and WSI AUC/F1/accuracy under comparable latency. Extensive ablations confirm the benefit of the local backward component and the sequence-reversal strategy, while hardware-aware CUDA design ensures minimal runtime overhead. The method generalizes across multiple SOTA Mamba-based models and MIL pipelines (e.g., MambaMIL), highlighting practical impact for efficient long-range vision modeling without sacrificing performance.

Abstract

Mamba, a State Space Model (SSM) that accelerates training by recasting recurrence as a parallel scan, has recently emerged as a linearly-scaling alternative to self-attention. Because of its unidirectional nature, each state in Mamba only has information of its previous states and is blind to states after. Current Mamba-based computer-vision methods typically overcome this by augmenting Mamba's global forward scan with a global backward scan, forming a bi-directional scan to restore a full receptive field. However, this operation doubles the computational load, eroding much of the efficiency advantage that originally Mamba have. To eliminate this extra scans, we introduce LBMamba, a locally bi-directional SSM block that embeds a lightweight locally backward scan inside the forward scan and executes it in per-thread registers. Building on LBMamba, we present LBVim, a backbone that alternates scan directions every two layers to recover a global receptive field without extra backward sweeps. We validate our approach on both natural images and whole slide images (WSIs) and show that it constantly offers a superior performance-throughput trade-off. Under the same throughput, LBVim achieves 0.8% to 1.6% higher top-1 accuracy on the ImageNet-1K classification dataset, 0.6% to 2.7% higher mIoU on the ADE20K semantic segmentation dataset, 0.9% higher APb and 1.1% higher APm on the COCO detection dataset. Our method also boosts the accuracy of four SOTA Mamba models, namely VMamba, LocalVim, PlainMamba and Adventurer, by 0.5% to 3.4%. We integrate LBMamba into the SOTA pathology multiple instance learning (MIL) model, MambaMIL, which is unidirectional. Experiments on 3 public WSI classification datasets show that our method achieves a relative improvement of up to 3.06% better AUC, 3.39% better F1, 1.67% better accuracy. Our code is available at https://github.com/cvlab-stonybrook/LBMamba.

LBMamba: Locally Bi-directional Mamba

TL;DR

LBMamba introduces a locally bi-directional state-space block that embeds a backward pass inside the forward scan to avoid costly global backward sweeps, and LBVim, a vision backbone that alternates scan directions every two layers to restore global receptive fields. The approach achieves superior throughput-accuracy trade-offs on natural images and giga-pixel WSIs, improving ImageNet top-1, ADE20K mIoU, COCO APm/b, and WSI AUC/F1/accuracy under comparable latency. Extensive ablations confirm the benefit of the local backward component and the sequence-reversal strategy, while hardware-aware CUDA design ensures minimal runtime overhead. The method generalizes across multiple SOTA Mamba-based models and MIL pipelines (e.g., MambaMIL), highlighting practical impact for efficient long-range vision modeling without sacrificing performance.

Abstract

Mamba, a State Space Model (SSM) that accelerates training by recasting recurrence as a parallel scan, has recently emerged as a linearly-scaling alternative to self-attention. Because of its unidirectional nature, each state in Mamba only has information of its previous states and is blind to states after. Current Mamba-based computer-vision methods typically overcome this by augmenting Mamba's global forward scan with a global backward scan, forming a bi-directional scan to restore a full receptive field. However, this operation doubles the computational load, eroding much of the efficiency advantage that originally Mamba have. To eliminate this extra scans, we introduce LBMamba, a locally bi-directional SSM block that embeds a lightweight locally backward scan inside the forward scan and executes it in per-thread registers. Building on LBMamba, we present LBVim, a backbone that alternates scan directions every two layers to recover a global receptive field without extra backward sweeps. We validate our approach on both natural images and whole slide images (WSIs) and show that it constantly offers a superior performance-throughput trade-off. Under the same throughput, LBVim achieves 0.8% to 1.6% higher top-1 accuracy on the ImageNet-1K classification dataset, 0.6% to 2.7% higher mIoU on the ADE20K semantic segmentation dataset, 0.9% higher APb and 1.1% higher APm on the COCO detection dataset. Our method also boosts the accuracy of four SOTA Mamba models, namely VMamba, LocalVim, PlainMamba and Adventurer, by 0.5% to 3.4%. We integrate LBMamba into the SOTA pathology multiple instance learning (MIL) model, MambaMIL, which is unidirectional. Experiments on 3 public WSI classification datasets show that our method achieves a relative improvement of up to 3.06% better AUC, 3.39% better F1, 1.67% better accuracy. Our code is available at https://github.com/cvlab-stonybrook/LBMamba.

Paper Structure

This paper contains 23 sections, 6 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: (Left): The unidirectional scanning mechanism of the vanilla Mamba gu2023mamba, where each state only has information of its previous/left states. (Center): Standard bi-directional scanning mechanism vim, where a dedicated backward scan is conducted and added to the forward scan. This operation involves an additional read/write of the data and thus doubles the running time. (Right): Our LBMamba scan conducts a locally backward scan which is integrated into the forward scan process. This involves only one time read/write operations and thus very fast.
  • Figure 1: Top-1 accuracy (%) and throughput (images/second, denoted as T.P.) of LBVim variants on ImageNet-1K with $224\times224$ inputs. LBVim-Ti matches Vim-Ti (with global average pooling) while delivering an 82% higher throughput. LBVim-S is only 0.7 percentage points below Vim-S yet runs 69% faster. LBVim-300 and LBVim-528 attain substantially higher accuracy than Vim-Ti and Vim-S, respectively, at comparable throughput.
  • Figure 2: (Left):The overall architecture of LBVim: The input image is split into patches and are embedded as patch tokens. These tokens, combined with positional embeddings, are fed to $U$ LBMamba encoders. Finally, an global average pooling (GAP) layer or a multi-head attention layer (MHA), followed by an MLP head, predicts the image class. (Right): The architecture of LBVim encoder. We reverse the sequence in the end of each encoder such that the global scan direction in LBMamba is switched every two consecutive encoders, ensuring that each token achieves a global receptive field after every two encoders.
  • Figure 3: Our hardware-aware LBMamba CUDA operator with thread-level locally bi-directional scan. Blue color represents operations on registers (Reg.), Orange color represents operations on SRAM and green color represents those on HBM. Blue box shows the scanning operations by the vanilla Mamba: two threads $T_1$ and $T_2$ first loads 3 elements from HBM to registers. The global forward scan is then conducted as follows: 1) Each thread performs an in-register prefix scan over 3 elements. 2) Threads exchange their partial results through SRAM to get the prefix of each thread. 3) Each thread combines its prefix with its private elements, completing the global scan. Finally, the scanned results are write back to HBM. Red box highlights the extra scanning operations by the LBMamba: Each thread performs an in-register backward scan over 3 elements (the same as step 1 except the direction) and add it to the forward scan results. All the extra operations are in registers and thus it is very fast. $h_{i\rightarrow j}$ is the the partial result, the hidden state obtained by scanning from time step $i$ to $j$.
  • Figure 4: The accuracy-throughput trade-off curve of Vim and LBVim. The curve of LBVim consistently lies in the upper-right quadrant relative to Vim, highlighting a more favorable trade-off. We include the base version of Vim to better illustrate the trends on larger models.
  • ...and 3 more figures