Table of Contents
Fetching ...

MSVM-UNet: Multi-Scale Vision Mamba UNet for Medical Image Segmentation

Chaowei Chen, Li Yu, Shiquan Min, Shunfang Wang

TL;DR

MSVM-UNet addresses the challenge of accurate medical image segmentation by jointly modeling long-range pixel dependencies and multi-scale feature representations in 2D data. It introduces the Multi-Scale Visual State Space (MSVSS) block, combining 2D Selective-Scan (SS2DBlock) for directional context with a Multi-Scale Feed-Forward Network (MS-FFN) that uses multi-scale depthwise convolutions, and pairs it with Large Kernel Patch Expanding (LKPE) for spatially aware upsampling. The approach achieves state-of-the-art performance on Synapse and ACDC, with improvements in Dice similarity and boundary accuracy, while maintaining computational efficiency through linear complexity components. These results suggest strong potential for robust, high-resolution medical image segmentation in clinical settings.

Abstract

State Space Models (SSMs), especially Mamba, have shown great promise in medical image segmentation due to their ability to model long-range dependencies with linear computational complexity. However, accurate medical image segmentation requires the effective learning of both multi-scale detailed feature representations and global contextual dependencies. Although existing works have attempted to address this issue by integrating CNNs and SSMs to leverage their respective strengths, they have not designed specialized modules to effectively capture multi-scale feature representations, nor have they adequately addressed the directional sensitivity problem when applying Mamba to 2D image data. To overcome these limitations, we propose a Multi-Scale Vision Mamba UNet model for medical image segmentation, termed MSVM-UNet. Specifically, by introducing multi-scale convolutions in the VSS blocks, we can more effectively capture and aggregate multi-scale feature representations from the hierarchical features of the VMamba encoder and better handle 2D visual data. Additionally, the large kernel patch expanding (LKPE) layers achieve more efficient upsampling of feature maps by simultaneously integrating spatial and channel information. Extensive experiments on the Synapse and ACDC datasets demonstrate that our approach is more effective than some state-of-the-art methods in capturing and aggregating multi-scale feature representations and modeling long-range dependencies between pixels.

MSVM-UNet: Multi-Scale Vision Mamba UNet for Medical Image Segmentation

TL;DR

MSVM-UNet addresses the challenge of accurate medical image segmentation by jointly modeling long-range pixel dependencies and multi-scale feature representations in 2D data. It introduces the Multi-Scale Visual State Space (MSVSS) block, combining 2D Selective-Scan (SS2DBlock) for directional context with a Multi-Scale Feed-Forward Network (MS-FFN) that uses multi-scale depthwise convolutions, and pairs it with Large Kernel Patch Expanding (LKPE) for spatially aware upsampling. The approach achieves state-of-the-art performance on Synapse and ACDC, with improvements in Dice similarity and boundary accuracy, while maintaining computational efficiency through linear complexity components. These results suggest strong potential for robust, high-resolution medical image segmentation in clinical settings.

Abstract

State Space Models (SSMs), especially Mamba, have shown great promise in medical image segmentation due to their ability to model long-range dependencies with linear computational complexity. However, accurate medical image segmentation requires the effective learning of both multi-scale detailed feature representations and global contextual dependencies. Although existing works have attempted to address this issue by integrating CNNs and SSMs to leverage their respective strengths, they have not designed specialized modules to effectively capture multi-scale feature representations, nor have they adequately addressed the directional sensitivity problem when applying Mamba to 2D image data. To overcome these limitations, we propose a Multi-Scale Vision Mamba UNet model for medical image segmentation, termed MSVM-UNet. Specifically, by introducing multi-scale convolutions in the VSS blocks, we can more effectively capture and aggregate multi-scale feature representations from the hierarchical features of the VMamba encoder and better handle 2D visual data. Additionally, the large kernel patch expanding (LKPE) layers achieve more efficient upsampling of feature maps by simultaneously integrating spatial and channel information. Extensive experiments on the Synapse and ACDC datasets demonstrate that our approach is more effective than some state-of-the-art methods in capturing and aggregating multi-scale feature representations and modeling long-range dependencies between pixels.
Paper Structure (26 sections, 10 equations, 5 figures, 6 tables)

This paper contains 26 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The overall architecture of our proposed MSVM-UNet. (a) The VMamba V2 encoder backbone network, (b) the decoder network composed of LKPE layers, MSVSS blocks, and FLKPE layers, (c) the Multi-Scale Visual State Space (MSVSS) block, (d) the Multi-Scale Feed-Forward Network (MS-FFN), and (e) the Large Kernel Patch Expanding (LKPE) layer. $f_1^e$, $f_2^e$, $f_3^e$, and $f_4^e$ are the output features of the four stages of hierarchical encoder backbones. $f_i^d$ and $f_{i + 1}^d$ represents the input and output features of the $i^{th}$ stage of the decoder, respectively.
  • Figure 2: Illustration of the core operations in the MSVSS block. (a) The 2D-Selective-Scan (SS2D) operation. The input are first divided into patches and then flattened along four scanning paths, and then sent to S6 respectively. Finally, each of them is restored according to the scanning path and added together to obtain the output. (b) The Multi-scale Feed-Forward Neural Network (MS-FFN) layer. The input passes through multi-scale convolutions to further aggregate diagonal information and capture multi-scale information representation.
  • Figure 3: Comparison of different patch expanding layers. (a) The patch expanding layer proposed by Swin-UNet, (b) the large kernel patch expanding (LKPE) layer proposed by us.
  • Figure 4: Visual comparison of different methods on the Synapse multi-organ dataset. The first column represents the ground truth, and the following columns represent the segmentation predictions of UNet, TransUNet, Swin-UMamba, VM-UNet, and MSVM-UNet methods, respectively. The superiority of our proposed method can be clearly seen in the organ regions highlighted by red rectangles. Various colored contour lines indicate the ground truth for the corresponding organs.
  • Figure 5: Visual comparison of decoder features with and without our proposed blocks. The first and second rows present the features of our decoder (with our proposed blocks) and the original decoder (without our proposed blocks), respectively. Layer numbers are numbered from the bottom to the top of the decoder.