LBMamba: Locally Bi-directional Mamba

Jingwei Zhang; Xi Han; Hong Qin; Mahdi S. Hosseini; Dimitris Samaras

LBMamba: Locally Bi-directional Mamba

Jingwei Zhang, Xi Han, Hong Qin, Mahdi S. Hosseini, Dimitris Samaras

TL;DR

LBMamba introduces a locally bi-directional state-space block that embeds a backward pass inside the forward scan to avoid costly global backward sweeps, and LBVim, a vision backbone that alternates scan directions every two layers to restore global receptive fields. The approach achieves superior throughput-accuracy trade-offs on natural images and giga-pixel WSIs, improving ImageNet top-1, ADE20K mIoU, COCO APm/b, and WSI AUC/F1/accuracy under comparable latency. Extensive ablations confirm the benefit of the local backward component and the sequence-reversal strategy, while hardware-aware CUDA design ensures minimal runtime overhead. The method generalizes across multiple SOTA Mamba-based models and MIL pipelines (e.g., MambaMIL), highlighting practical impact for efficient long-range vision modeling without sacrificing performance.

Abstract

Mamba, a State Space Model (SSM) that accelerates training by recasting recurrence as a parallel scan, has recently emerged as a linearly-scaling alternative to self-attention. Because of its unidirectional nature, each state in Mamba only has information of its previous states and is blind to states after. Current Mamba-based computer-vision methods typically overcome this by augmenting Mamba's global forward scan with a global backward scan, forming a bi-directional scan to restore a full receptive field. However, this operation doubles the computational load, eroding much of the efficiency advantage that originally Mamba have. To eliminate this extra scans, we introduce LBMamba, a locally bi-directional SSM block that embeds a lightweight locally backward scan inside the forward scan and executes it in per-thread registers. Building on LBMamba, we present LBVim, a backbone that alternates scan directions every two layers to recover a global receptive field without extra backward sweeps. We validate our approach on both natural images and whole slide images (WSIs) and show that it constantly offers a superior performance-throughput trade-off. Under the same throughput, LBVim achieves 0.8% to 1.6% higher top-1 accuracy on the ImageNet-1K classification dataset, 0.6% to 2.7% higher mIoU on the ADE20K semantic segmentation dataset, 0.9% higher APb and 1.1% higher APm on the COCO detection dataset. Our method also boosts the accuracy of four SOTA Mamba models, namely VMamba, LocalVim, PlainMamba and Adventurer, by 0.5% to 3.4%. We integrate LBMamba into the SOTA pathology multiple instance learning (MIL) model, MambaMIL, which is unidirectional. Experiments on 3 public WSI classification datasets show that our method achieves a relative improvement of up to 3.06% better AUC, 3.39% better F1, 1.67% better accuracy. Our code is available at https://github.com/cvlab-stonybrook/LBMamba.

LBMamba: Locally Bi-directional Mamba

TL;DR

Abstract

LBMamba: Locally Bi-directional Mamba

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)