Table of Contents
Fetching ...

LoG-VMamba: Local-Global Vision Mamba for Medical Image Segmentation

Trung Dinh Quoc Dang, Huy Hoang Nguyen, Aleksei Tiulpin

TL;DR

This paper addresses MIS segmentation by modeling both local details and global context using Vision State Space Models. It introduces LoG-VMamba, a module that combines Local Token Extractor (LTX) for locality with Global Token Extractor (GTX) for compressed global context, integrated into 2D Swin-UMamba and 3D U-Mamba-Enc architectures. The approach eliminates the need for complex scanning strategies, delivering state-of-the-art or competitive results across 2D and 3D MIS benchmarks (Endoscopy, Cell, BraTS, ACDC) with improved efficiency. Ablation studies confirm the effectiveness of the LTX/GTX pair and the Interleaved global token placement, highlighting the method's potential for broader MIS applications and beyond.

Abstract

Mamba, a State Space Model (SSM), has recently shown competitive performance to Convolutional Neural Networks (CNNs) and Transformers in Natural Language Processing and general sequence modeling. Various attempts have been made to adapt Mamba to Computer Vision tasks, including medical image segmentation (MIS). Vision Mamba (VM)-based networks are particularly attractive due to their ability to achieve global receptive fields, similar to Vision Transformers, while also maintaining linear complexity in the number of tokens. However, the existing VM models still struggle to maintain both spatially local and global dependencies of tokens in high dimensional arrays due to their sequential nature. Employing multiple and/or complicated scanning strategies is computationally costly, which hinders applications of SSMs to high-dimensional 2D and 3D images that are common in MIS problems. In this work, we propose Local-Global Vision Mamba, LoG-VMamba, that explicitly enforces spatially adjacent tokens to remain nearby on the channel axis, and retains the global context in a compressed form. Our method allows the SSMs to access the local and global contexts even before reaching the last token while requiring only a simple scanning strategy. Our segmentation models are computationally efficient and substantially outperform both CNN and Transformers-based baselines on a diverse set of 2D and 3D MIS tasks. The implementation of LoG-VMamba is available at \url{https://github.com/Oulu-IMEDS/LoG-VMamba}.

LoG-VMamba: Local-Global Vision Mamba for Medical Image Segmentation

TL;DR

This paper addresses MIS segmentation by modeling both local details and global context using Vision State Space Models. It introduces LoG-VMamba, a module that combines Local Token Extractor (LTX) for locality with Global Token Extractor (GTX) for compressed global context, integrated into 2D Swin-UMamba and 3D U-Mamba-Enc architectures. The approach eliminates the need for complex scanning strategies, delivering state-of-the-art or competitive results across 2D and 3D MIS benchmarks (Endoscopy, Cell, BraTS, ACDC) with improved efficiency. Ablation studies confirm the effectiveness of the LTX/GTX pair and the Interleaved global token placement, highlighting the method's potential for broader MIS applications and beyond.

Abstract

Mamba, a State Space Model (SSM), has recently shown competitive performance to Convolutional Neural Networks (CNNs) and Transformers in Natural Language Processing and general sequence modeling. Various attempts have been made to adapt Mamba to Computer Vision tasks, including medical image segmentation (MIS). Vision Mamba (VM)-based networks are particularly attractive due to their ability to achieve global receptive fields, similar to Vision Transformers, while also maintaining linear complexity in the number of tokens. However, the existing VM models still struggle to maintain both spatially local and global dependencies of tokens in high dimensional arrays due to their sequential nature. Employing multiple and/or complicated scanning strategies is computationally costly, which hinders applications of SSMs to high-dimensional 2D and 3D images that are common in MIS problems. In this work, we propose Local-Global Vision Mamba, LoG-VMamba, that explicitly enforces spatially adjacent tokens to remain nearby on the channel axis, and retains the global context in a compressed form. Our method allows the SSMs to access the local and global contexts even before reaching the last token while requiring only a simple scanning strategy. Our segmentation models are computationally efficient and substantially outperform both CNN and Transformers-based baselines on a diverse set of 2D and 3D MIS tasks. The implementation of LoG-VMamba is available at \url{https://github.com/Oulu-IMEDS/LoG-VMamba}.
Paper Structure (14 sections, 3 equations, 7 figures, 10 tables)

This paper contains 14 sections, 3 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: The comparison between how feature extractors establish a correlation between the query token (in red) and a neighboring token (in green). In (c-d) liu2024vmamba, the distance between the query and its neighbor may be roughly one row (or column) of tokens.
  • Figure 2: Local and global token extractors. DWC indicates a depthwise convolutional layer. $S$ and $K$ correspond to the depthwise and spatial compression in the DWCs of (a) and (b), respectively.
  • Figure 3: LoG-VMamba and its simpler versions compared to the vanilla VSS liu2024vmamba. LN and SSM mean layer normalization ba2016layer and the S6 block in gu2023mamba, respectively. Vanilla indicates the module consisting of a DWC layer and SiLU followed by a reshaping operator. The linear block after SSM is only needed in L-VMamba and LoG-VMamba. White blocks indicate modules without learnable parameters. $\bigoplus, \bigotimes$, and © represent element-wise addition, multiplication, and concatenation, respectively. The SSM block in our settings performs only 1 horizontal scan.
  • Figure 4: Qualitative comparisons between our method and the baselines
  • Figure 5: Computational efficiency and performance comparisons on the 2D datasets. Stars indicate the means while blurry dots represent the individual results.
  • ...and 2 more figures