Table of Contents
Fetching ...

MambaMIM: Pre-training Mamba with State Space Token Interpolation and its Application to Medical Image Segmentation

Fenghe Tang, Bingkun Nian, Yingtai Li, Zihang Jiang, Jie Yang, Wei Liu, S. Kevin Zhou

TL;DR

MambaMIM presents a generative self-supervised pre-training framework tailored for Mamba-based models in 3D medical imaging. By introducing TOKI, a state-space sequence token interpolation, and a bottom-up hierarchical masking scheme, it preserves causal relationships and ensures masking consistency across single and hybrid Mamba architectures. Pre-training on a large CT corpus and fine-tuning across multiple segmentation benchmarks yield substantial gains, including state-of-the-art performance on several tasks and strong generalization to unseen datasets and MRI. The approach advances efficient long-range modeling in medical image segmentation and demonstrates the practical value of state-space–aware pre-training for Mamba architectures.

Abstract

Recently, the state space model Mamba has demonstrated efficient long-sequence modeling capabilities, particularly for addressing long-sequence visual tasks in 3D medical imaging. However, existing generative self-supervised learning methods have not yet fully unleashed Mamba's potential for handling long-range dependencies because they overlook the inherent causal properties of state space sequences in masked modeling. To address this challenge, we propose a general-purpose pre-training framework called MambaMIM, a masked image modeling method based on a novel TOKen-Interpolation strategy (TOKI) for the selective structure state space sequence, which learns causal relationships of state space within the masked sequence. Further, MambaMIM introduces a bottom-up 3D hybrid masking strategy to maintain a masking consistency across different architectures and can be used on any single or hybrid Mamba architecture to enhance its multi-scale and long-range representation capability. We pre-train MambaMIM on a large-scale dataset of 6.8K CT scans and evaluate its performance across eight public medical segmentation benchmarks. Extensive downstream experiments reveal the feasibility and advancement of using Mamba for medical image pre-training. In particular, when we apply the MambaMIM to a customized architecture that hybridizes MedNeXt and Vision Mamba, we consistently obtain the state-of-the-art segmentation performance. The code is available at: https://github.com/FengheTan9/MambaMIM.

MambaMIM: Pre-training Mamba with State Space Token Interpolation and its Application to Medical Image Segmentation

TL;DR

MambaMIM presents a generative self-supervised pre-training framework tailored for Mamba-based models in 3D medical imaging. By introducing TOKI, a state-space sequence token interpolation, and a bottom-up hierarchical masking scheme, it preserves causal relationships and ensures masking consistency across single and hybrid Mamba architectures. Pre-training on a large CT corpus and fine-tuning across multiple segmentation benchmarks yield substantial gains, including state-of-the-art performance on several tasks and strong generalization to unseen datasets and MRI. The approach advances efficient long-range modeling in medical image segmentation and demonstrates the practical value of state-space–aware pre-training for Mamba architectures.

Abstract

Recently, the state space model Mamba has demonstrated efficient long-sequence modeling capabilities, particularly for addressing long-sequence visual tasks in 3D medical imaging. However, existing generative self-supervised learning methods have not yet fully unleashed Mamba's potential for handling long-range dependencies because they overlook the inherent causal properties of state space sequences in masked modeling. To address this challenge, we propose a general-purpose pre-training framework called MambaMIM, a masked image modeling method based on a novel TOKen-Interpolation strategy (TOKI) for the selective structure state space sequence, which learns causal relationships of state space within the masked sequence. Further, MambaMIM introduces a bottom-up 3D hybrid masking strategy to maintain a masking consistency across different architectures and can be used on any single or hybrid Mamba architecture to enhance its multi-scale and long-range representation capability. We pre-train MambaMIM on a large-scale dataset of 6.8K CT scans and evaluate its performance across eight public medical segmentation benchmarks. Extensive downstream experiments reveal the feasibility and advancement of using Mamba for medical image pre-training. In particular, when we apply the MambaMIM to a customized architecture that hybridizes MedNeXt and Vision Mamba, we consistently obtain the state-of-the-art segmentation performance. The code is available at: https://github.com/FengheTan9/MambaMIM.
Paper Structure (17 sections, 8 equations, 11 figures, 10 tables, 3 algorithms)

This paper contains 17 sections, 8 equations, 11 figures, 10 tables, 3 algorithms.

Figures (11)

  • Figure 1: Different token generation strategies for Mamba-based network. (a) Random learnable mask token is generated for decoding. (b) Token-interpolation (TOKI) applies the structure sequence relationships within the state space for Mamba.
  • Figure 2: Different mask token strategies for Vanilla Mamba and Hybrid Mamba pre-trained on the BTCV dataset btcv for the 3D segmentation task. The improvements brought by TOKI surpass previous SSL methods with the learnable token, and they are much better than those without pre-training.
  • Figure 3: The illustration of the masking inconsistency problem. (a) Directly dropping in Mamba. (b) Sparsely dropping in CNN. (c) Masking inconsistency in hybrid architecture. (d) Masking consistency in hybrid architecture. Mask inconsistency in hybrid architectures refers to the cross-architecture inconsistency in mask position that induces pixel intensity distributional shifts during encoding, which degrades representation learning.
  • Figure 4: The whole structure of MambaMIM. The hybrid encoder performs bottom-up masked modeling, the initialization unmasking patch is white, the bottom-up mapping unmasking patch is purple and the masking position is empty. The CNN stage (yellow) utilizes 3D sparse operator for hierarchical encoding and fills learnable mask tokens in masked position for decoding. The Mamba stage (purple) only learns unmasked sequences, and TOKI is applied during decoding which can preserve the continuity of 1D selective structure state space sequence.
  • Figure 5: Visualization results on BTCV dataset.
  • ...and 6 more figures