MSLAU-Net: A Hybird CNN-Transformer Network for Medical Image Segmentation
Libin Lan, Yanxin Li, Xiaojuan Liu, Juan Zhou, Jianxun Zhang, Nannan Huang, Yudong Zhang
TL;DR
MSLAU-Net addresses the challenge of balancing local detail and global context in medical image segmentation by integrating a Multi-Scale Linear Attention (MSLA) module with a CNN-Transformer encoder and a top-down decoder. The method achieves state-of-the-art results across Synapse, ACDC, and CVC-ClinicDB, supported by extensive ablations that justify the four-stage encoder design and multi-scale attention strategy. Key contributions include the MSLA module that fuses depth-wise multi-scale features with linear attention and a lightweight, asymmetric decoder for effective multi-level feature aggregation. The results demonstrate strong generalization and robustness, with practical efficiency suitable for clinical deployment, and the work provides open-source code for reproducibility.
Abstract
Both CNN-based and Transformer-based methods have achieved remarkable success in medical image segmentation tasks. However, CNN-based methods struggle to effectively capture global contextual information due to the inherent limitations of convolution operations. Meanwhile, Transformer-based methods suffer from insufficient local feature modeling and face challenges related to the high computational complexity caused by the self-attention mechanism. To address these limitations, we propose a novel hybrid CNN-Transformer architecture, named MSLAU-Net, which integrates the strengths of both paradigms. The proposed MSLAU-Net incorporates two key ideas. First, it introduces Multi-Scale Linear Attention, designed to efficiently extract multi-scale features from medical images while modeling long-range dependencies with low computational complexity. Second, it adopts a top-down feature aggregation mechanism, which performs multi-level feature aggregation and restores spatial resolution using a lightweight structure. Extensive experiments conducted on benchmark datasets covering three imaging modalities demonstrate that the proposed MSLAU-Net outperforms other state-of-the-art methods on nearly all evaluation metrics, validating the superiority, effectiveness, and robustness of our approach. Our code is available at https://github.com/Monsoon49/MSLAU-Net.
