Table of Contents
Fetching ...

Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos

Jiaheng Zhou, Yanfeng Zhou, Wei Fang, Yuxing Tang, Le Lu, Ge Yang

TL;DR

This work tackles data scarcity in medical ultrasound video analysis by introducing E-ViM³, a Mamba-3D network that preserves 3D structure to improve space-time modeling. It combines Enclosure Global Tokens (EGT) for robust global feature aggregation with Spatial-Temporal Chained (STC) masking for data-efficient self-supervised pre-training, forming a masked autoencoder framework tailored for 3D video data. Across EchoNet-Dynamic, CAMUS, MICCAI-BUV, and WHBUS, E-ViM³ achieves state-of-the-art or competitive performance on EF prediction and breast cancer classification, and maintains strong results with limited labeling, highlighting practical clinical impact. The approach offers a scalable path to 3D ultrasound analysis and can be extended to other 3D or high-dimensional visual data tasks, due to its data-efficient pre-training and efficient 3D token handling.

Abstract

Ultrasound videos are an important form of clinical imaging data, and deep learning-based automated analysis can improve diagnostic accuracy and clinical efficiency. However, the scarcity of labeled data and the inherent challenges of video analysis have impeded the advancement of related methods. In this work, we introduce E-ViM$^3$, a data-efficient Vision Mamba network that preserves the 3D structure of video data, enhancing long-range dependencies and inductive biases to better model space-time correlations. With our design of Enclosure Global Tokens (EGT), the model captures and aggregates global features more effectively than competing methods. To further improve data efficiency, we employ masked video modeling for self-supervised pre-training, with the proposed Spatial-Temporal Chained (STC) masking strategy designed to adapt to various video scenarios. Experiments demonstrate that E-ViM$^3$ performs as the state-of-the-art in two high-level semantic analysis tasks across four datasets of varying sizes: EchoNet-Dynamic, CAMUS, MICCAI-BUV, and WHBUS. Furthermore, our model achieves competitive performance with limited labels, highlighting its potential impact on real-world clinical applications.

Mamba-3D as Masked Autoencoders for Accurate and Data-Efficient Analysis of Medical Ultrasound Videos

TL;DR

This work tackles data scarcity in medical ultrasound video analysis by introducing E-ViM³, a Mamba-3D network that preserves 3D structure to improve space-time modeling. It combines Enclosure Global Tokens (EGT) for robust global feature aggregation with Spatial-Temporal Chained (STC) masking for data-efficient self-supervised pre-training, forming a masked autoencoder framework tailored for 3D video data. Across EchoNet-Dynamic, CAMUS, MICCAI-BUV, and WHBUS, E-ViM³ achieves state-of-the-art or competitive performance on EF prediction and breast cancer classification, and maintains strong results with limited labeling, highlighting practical clinical impact. The approach offers a scalable path to 3D ultrasound analysis and can be extended to other 3D or high-dimensional visual data tasks, due to its data-efficient pre-training and efficient 3D token handling.

Abstract

Ultrasound videos are an important form of clinical imaging data, and deep learning-based automated analysis can improve diagnostic accuracy and clinical efficiency. However, the scarcity of labeled data and the inherent challenges of video analysis have impeded the advancement of related methods. In this work, we introduce E-ViM, a data-efficient Vision Mamba network that preserves the 3D structure of video data, enhancing long-range dependencies and inductive biases to better model space-time correlations. With our design of Enclosure Global Tokens (EGT), the model captures and aggregates global features more effectively than competing methods. To further improve data efficiency, we employ masked video modeling for self-supervised pre-training, with the proposed Spatial-Temporal Chained (STC) masking strategy designed to adapt to various video scenarios. Experiments demonstrate that E-ViM performs as the state-of-the-art in two high-level semantic analysis tasks across four datasets of varying sizes: EchoNet-Dynamic, CAMUS, MICCAI-BUV, and WHBUS. Furthermore, our model achieves competitive performance with limited labels, highlighting its potential impact on real-world clinical applications.

Paper Structure

This paper contains 46 sections, 9 equations, 10 figures, 14 tables, 1 algorithm.

Figures (10)

  • Figure 1: Mamba-3D as masked autoencoders. By preserving the structure of video data, Mamba-3D provides a stronger inductive bias for masked video modeling than ViT Dosovitskiy1Tong1 or vanilla Mamba for visual data Zhu1Liu3Li1, both of which operate on flattened 1D sequences and rely heavily on positional encodings. This enables effective self-supervised learning with limited data.
  • Figure 2: The pipeline of self-supervised pre-training and fine-tuning of the proposed E-ViM³ model. The upper section represents the pre-training phase; the lower section represents the fine-tuning phase. Initially, the video is embedded as 3D patches but not directly flattened into a 1D sequence. During pre-training, the proposed Spatial-Temporal Chained masking is applied, and masked tokens are removed for efficiency. The 3D-structured tokens, along with the proposed Enclosure Global Tokens, are inputted into the Mamba-3D encoder. The pre-training task is to restore the masked patches using a decoder, while downstream tasks leverage the features extracted from the global tokens. Further details can be found in \ref{['sec:methods']}. A complete architecture diagram is available in the supplementary material.
  • Figure 3: An illustration of adding Enclosure Global Tokens to the original video embedding. The optional inner planes are also taken into account. In this case, we use $\mathrm{L_g=H_g=W_g=3}$ as an example.
  • Figure 4: The main data flow: from the input video to the encoder's output without token masking. Please note the changes in data dimensions and their order. Global tokens are omitted on the right for simplicity. Diagrams with more details are available in the supplementary material.
  • Figure 5: Different masking strategies, including the proposed Spatial-Temporal Chained (STC) masking. Here, we use $\mathrm{\gamma_t=\gamma_s=2}$ as an example for STC. The other three widely used strategies are special cases of STC with specific hyper-parameters: $\mathrm{\gamma_t=\gamma_s=1}$ for Random (Agnostic), $\mathrm{\gamma_t=1, \gamma_s=H=W}$ for Frame, and $\mathrm{\gamma_t=L, \gamma_s=1}$ for Tube.
  • ...and 5 more figures