Semi-Mamba-UNet: Pixel-Level Contrastive and Pixel-Level Cross-Supervised Visual Mamba-based UNet for Semi-Supervised Medical Image Segmentation
Chao Ma, Ziyang Wang
TL;DR
This work tackles the challenge of scarce annotations in medical image segmentation by introducing Semi-Mamba-UNet, which fuses a visual Mamba-based U-Net with a CNN-based UNet within a semi-supervised framework. The method employs pixel-level cross-supervision between the two backbones and a pixel-level contrastive learning component, with losses decomposed into $\mathcal{L}_{\rm sup}$, $\mathcal{L}_{\rm semi}$, and $\mathcal{L}_{\rm contra}$ to leverage both labeled and unlabeled data. A cross-architecture SSL strategy enables mutual pseudo-labeling and cross-training, while pixel-level projections enhance feature learning on unlabeled samples. Evaluations on MRI cardiac (ACDC) and MR prostate (PROMISE12) datasets demonstrate superior performance over seven SSL baselines, with open-source code for reproducibility.
Abstract
Medical image segmentation is essential in diagnostics, treatment planning, and healthcare, with deep learning offering promising advancements. Notably, the convolutional neural network (CNN) excels in capturing local image features, whereas the Vision Transformer (ViT) adeptly models long-range dependencies through multi-head self-attention mechanisms. Despite their strengths, both the CNN and ViT face challenges in efficiently processing long-range dependencies in medical images, often requiring substantial computational resources. This issue, combined with the high cost and limited availability of expert annotations, poses significant obstacles to achieving precise segmentation. To address these challenges, this study introduces Semi-Mamba-UNet, which integrates a purely visual Mamba-based U-shaped encoder-decoder architecture with a conventional CNN-based UNet into a semi-supervised learning (SSL) framework. This innovative SSL approach leverages both networks to generate pseudo-labels and cross-supervise one another at the pixel level simultaneously, drawing inspiration from consistency regularisation techniques. Furthermore, we introduce a self-supervised pixel-level contrastive learning strategy that employs a pair of projectors to enhance the feature learning capabilities further, especially on unlabelled data. Semi-Mamba-UNet was comprehensively evaluated on two publicly available segmentation dataset and compared with seven other SSL frameworks with both CNN- or ViT-based UNet as the backbone network, highlighting the superior performance of the proposed method. The source code of Semi-Mamba-Unet, all baseline SSL frameworks, the CNN- and ViT-based networks, and the two corresponding datasets are made publicly accessible.
