Table of Contents
Fetching ...

Vivim: a Video Vision Mamba for Medical Video Segmentation

Yijun Yang, Zhaohu Xing, Lequan Yu, Chunwang Huang, Huazhu Fu, Lei Zhu

TL;DR

Vivim tackles the challenge of long-range spatiotemporal modeling in medical video segmentation by integrating structured state space models (Mamba) into a hierarchical Transformer backbone, forming Temporal Mamba Blocks with spatiotemporal selective scans. A boundary-aware affine constraint is added during training to sharpen lesion boundaries. The approach yields state-of-the-art results on thyroid and breast ultrasound video segmentation and colonoscopy polyp videos, while maintaining efficiency superior to transformer-based methods. A new VTUS thyroid ultrasound dataset is introduced to support benchmarking, underscoring the method's practicality and potential clinical impact.

Abstract

Medical video segmentation gains increasing attention in clinical practice due to the redundant dynamic references in video frames. However, traditional convolutional neural networks have a limited receptive field and transformer-based networks are mediocre in constructing long-term dependency from the perspective of computational complexity. This bottleneck poses a significant challenge when processing longer sequences in medical video analysis tasks using available devices with limited memory. Recently, state space models (SSMs), famous by Mamba, have exhibited impressive achievements in efficient long sequence modeling, which develops deep neural networks by expanding the receptive field on many vision tasks significantly. Unfortunately, vanilla SSMs failed to simultaneously capture causal temporal cues and preserve non-casual spatial information. To this end, this paper presents a Video Vision Mamba-based framework, dubbed as Vivim, for medical video segmentation tasks. Our Vivim can effectively compress the long-term spatiotemporal representation into sequences at varying scales with our designed Temporal Mamba Block. We also introduce an improved boundary-aware affine constraint across frames to enhance the discriminative ability of Vivim on ambiguous lesions. Extensive experiments on thyroid segmentation, breast lesion segmentation in ultrasound videos, and polyp segmentation in colonoscopy videos demonstrate the effectiveness and efficiency of our Vivim, superior to existing methods. The code is available at: https://github.com/scott-yjyang/Vivim. The dataset will be released once accepted.

Vivim: a Video Vision Mamba for Medical Video Segmentation

TL;DR

Vivim tackles the challenge of long-range spatiotemporal modeling in medical video segmentation by integrating structured state space models (Mamba) into a hierarchical Transformer backbone, forming Temporal Mamba Blocks with spatiotemporal selective scans. A boundary-aware affine constraint is added during training to sharpen lesion boundaries. The approach yields state-of-the-art results on thyroid and breast ultrasound video segmentation and colonoscopy polyp videos, while maintaining efficiency superior to transformer-based methods. A new VTUS thyroid ultrasound dataset is introduced to support benchmarking, underscoring the method's practicality and potential clinical impact.

Abstract

Medical video segmentation gains increasing attention in clinical practice due to the redundant dynamic references in video frames. However, traditional convolutional neural networks have a limited receptive field and transformer-based networks are mediocre in constructing long-term dependency from the perspective of computational complexity. This bottleneck poses a significant challenge when processing longer sequences in medical video analysis tasks using available devices with limited memory. Recently, state space models (SSMs), famous by Mamba, have exhibited impressive achievements in efficient long sequence modeling, which develops deep neural networks by expanding the receptive field on many vision tasks significantly. Unfortunately, vanilla SSMs failed to simultaneously capture causal temporal cues and preserve non-casual spatial information. To this end, this paper presents a Video Vision Mamba-based framework, dubbed as Vivim, for medical video segmentation tasks. Our Vivim can effectively compress the long-term spatiotemporal representation into sequences at varying scales with our designed Temporal Mamba Block. We also introduce an improved boundary-aware affine constraint across frames to enhance the discriminative ability of Vivim on ambiguous lesions. Extensive experiments on thyroid segmentation, breast lesion segmentation in ultrasound videos, and polyp segmentation in colonoscopy videos demonstrate the effectiveness and efficiency of our Vivim, superior to existing methods. The code is available at: https://github.com/scott-yjyang/Vivim. The dataset will be released once accepted.
Paper Structure (26 sections, 9 equations, 7 figures, 4 tables)

This paper contains 26 sections, 9 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Several cases of our collected VTUS dataset. All videos are taken from patients with thyroid nodules. They are taken by ultrasound doctors with more than 10 years of clinical experience to ensure the image quality. These videos are cross-annotated by three experts with over three years of experience in thyroid diagnosis.
  • Figure 2: (a ) The overview of the proposed Vivim for medical video segmentation. The video sequence is first fed into patch embedding and multi-scale Temporal Mamba Blocks for encoding. Then, the feature sequences are aggregated to predict the segmentation results by a CNN-based segmentation head. (b) The fundamental building block of Vivim, namely Temporal Mamba Block. While Efficient Spatial Self-attention conducts initial spatial modeling, ST-Mamba explores spatiotemporal dependency in a linear complexity. (c) ST-Mamba incorporates spatiotemporal selective scan for long sequence modeling of video vision tasks in a multi-way spirit.
  • Figure 3: The illustration of the proposed spatiotemporal selective scan, including temporal forward scan, temporal backward scan and spatial scan.
  • Figure 4: The overview of the training strategy. Specifically, our proposed patch-level boundary-aware affine constraint $\mathcal{L}_{affine}$ is introduced to optimize Vivim jointly with the segmentation loss $\mathcal{L}_{seg}$ and the boundary cross-entropy loss $\mathcal{L}_{bce}$. The pre-trained MLP for computing the affine transformation is frozen during training.
  • Figure 5: Visual comparison on video ultrasound thyroid segmentation with several competitive image- and video-based methods. Consecutive results of one case are displayed.
  • ...and 2 more figures