Table of Contents
Fetching ...

MSLAU-Net: A Hybird CNN-Transformer Network for Medical Image Segmentation

Libin Lan, Yanxin Li, Xiaojuan Liu, Juan Zhou, Jianxun Zhang, Nannan Huang, Yudong Zhang

TL;DR

MSLAU-Net addresses the challenge of balancing local detail and global context in medical image segmentation by integrating a Multi-Scale Linear Attention (MSLA) module with a CNN-Transformer encoder and a top-down decoder. The method achieves state-of-the-art results across Synapse, ACDC, and CVC-ClinicDB, supported by extensive ablations that justify the four-stage encoder design and multi-scale attention strategy. Key contributions include the MSLA module that fuses depth-wise multi-scale features with linear attention and a lightweight, asymmetric decoder for effective multi-level feature aggregation. The results demonstrate strong generalization and robustness, with practical efficiency suitable for clinical deployment, and the work provides open-source code for reproducibility.

Abstract

Both CNN-based and Transformer-based methods have achieved remarkable success in medical image segmentation tasks. However, CNN-based methods struggle to effectively capture global contextual information due to the inherent limitations of convolution operations. Meanwhile, Transformer-based methods suffer from insufficient local feature modeling and face challenges related to the high computational complexity caused by the self-attention mechanism. To address these limitations, we propose a novel hybrid CNN-Transformer architecture, named MSLAU-Net, which integrates the strengths of both paradigms. The proposed MSLAU-Net incorporates two key ideas. First, it introduces Multi-Scale Linear Attention, designed to efficiently extract multi-scale features from medical images while modeling long-range dependencies with low computational complexity. Second, it adopts a top-down feature aggregation mechanism, which performs multi-level feature aggregation and restores spatial resolution using a lightweight structure. Extensive experiments conducted on benchmark datasets covering three imaging modalities demonstrate that the proposed MSLAU-Net outperforms other state-of-the-art methods on nearly all evaluation metrics, validating the superiority, effectiveness, and robustness of our approach. Our code is available at https://github.com/Monsoon49/MSLAU-Net.

MSLAU-Net: A Hybird CNN-Transformer Network for Medical Image Segmentation

TL;DR

MSLAU-Net addresses the challenge of balancing local detail and global context in medical image segmentation by integrating a Multi-Scale Linear Attention (MSLA) module with a CNN-Transformer encoder and a top-down decoder. The method achieves state-of-the-art results across Synapse, ACDC, and CVC-ClinicDB, supported by extensive ablations that justify the four-stage encoder design and multi-scale attention strategy. Key contributions include the MSLA module that fuses depth-wise multi-scale features with linear attention and a lightweight, asymmetric decoder for effective multi-level feature aggregation. The results demonstrate strong generalization and robustness, with practical efficiency suitable for clinical deployment, and the work provides open-source code for reproducibility.

Abstract

Both CNN-based and Transformer-based methods have achieved remarkable success in medical image segmentation tasks. However, CNN-based methods struggle to effectively capture global contextual information due to the inherent limitations of convolution operations. Meanwhile, Transformer-based methods suffer from insufficient local feature modeling and face challenges related to the high computational complexity caused by the self-attention mechanism. To address these limitations, we propose a novel hybrid CNN-Transformer architecture, named MSLAU-Net, which integrates the strengths of both paradigms. The proposed MSLAU-Net incorporates two key ideas. First, it introduces Multi-Scale Linear Attention, designed to efficiently extract multi-scale features from medical images while modeling long-range dependencies with low computational complexity. Second, it adopts a top-down feature aggregation mechanism, which performs multi-level feature aggregation and restores spatial resolution using a lightweight structure. Extensive experiments conducted on benchmark datasets covering three imaging modalities demonstrate that the proposed MSLAU-Net outperforms other state-of-the-art methods on nearly all evaluation metrics, validating the superiority, effectiveness, and robustness of our approach. Our code is available at https://github.com/Monsoon49/MSLAU-Net.

Paper Structure

This paper contains 26 sections, 17 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Details of Multi-Scale Linear Attention. The MSLA module is designed in parallel to take full advantage of CNNs for capturing multi-scale features and linear attention for modeling long-range dependencies. The input feature map is first divided into four parts along the channel dimension. Each part is then processed through depth-wise convolution with different kernel sizes (3$\times$3, 5$\times$5, 7$\times$7, and 9$\times$9) to extract multi-scale features. Next, linear attention, i.e., Efficient Attention, is applied to the multi-scale features to model long-rage dependences. Finally, the resulting outputs are fused using a 1$\times$1 convolution.
  • Figure 2: Details of the LFE block. An LFE block consists of three key modules: a 3$\times$3 depth-wise convolution, three consecutive convolutional layers, and an FFN.
  • Figure 3: Details of the GFE Block. A GFE block comprises three main components: a 3$\times$3 depth-wise convolution, an MSLA module, and an FFN.
  • Figure 4: The proposed MSLAU-Net adopts an encoder-decoder structure. The encoder integrates CNN and transformer components, utilizing LFE and GFE blocks for local and global feature extraction, respectively. The decoder employs a top-down aggregation mechanism to aggregate multi-level features from the corresponding stages of the encoder. These features are then upsampled to the original image resolution, producing the final mask prediction.
  • Figure 5: Qualitative results of different methods on the Synapse multi-organ segmentation dataset. Our MSLAU-Net captures organ boundaries more accurately and demonstrates superior detail-handling capabilities. Best viewed in color with zoom-in.
  • ...and 2 more figures