EM-Net: Efficient Channel and Frequency Learning with Mamba for 3D Medical Image Segmentation
Ao Chang, Jiajun Zeng, Ruobing Huang, Dong Ni
TL;DR
EM-Net tackles 3D medical image segmentation by integrating a Mamba-based channel-attention mechanism with an efficient frequency-domain learning layer to fuse global and local features while reducing computation. The encoder uses CSRM blocks for channel-selective attention and CSRM-F blocks with an EFL layer to balance global and local cues, complemented by a four-stage Mamba-infused decoder. On Synapse and BTCV CT datasets, EM-Net achieves higher dice scores with roughly half the parameters and about 2x faster training than SOTA methods, demonstrating robust cross-organ segmentation performance. The approach provides a scalable, memory-efficient framework for high-resolution 3D segmentation with strong cross-scale feature integration and competitive efficiency.
Abstract
Convolutional neural networks have primarily led 3D medical image segmentation but may be limited by small receptive fields. Transformer models excel in capturing global relationships through self-attention but are challenged by high computational costs at high resolutions. Recently, Mamba, a state space model, has emerged as an effective approach for sequential modeling. Inspired by its success, we introduce a novel Mamba-based 3D medical image segmentation model called EM-Net. It not only efficiently captures attentive interaction between regions by integrating and selecting channels, but also effectively utilizes frequency domain to harmonize the learning of features across varying scales, while accelerating training speed. Comprehensive experiments on two challenging multi-organ datasets with other state-of-the-art (SOTA) algorithms show that our method exhibits better segmentation accuracy while requiring nearly half the parameter size of SOTA models and 2x faster training speed.
