Table of Contents
Fetching ...

Optimizing Medical Image Segmentation with Advanced Decoder Design

Weibin Yang, Zhiqi Dong, Mingyuan Xu, Longwei Xu, Dehua Geng, Yusong Li, Pengwei Wang

TL;DR

Swin DER is proposed, which performs upsampling using learnable interpolation algorithm called offset coordinate neighborhood weighted up sampling (Onsampling), replaces traditional skip connection with spatial-channel parallel attention gate (SCP AG), and introduces deformable convolution along with attention mechanism in the feature extraction module of the decoder.

Abstract

U-Net is widely used in medical image segmentation due to its simple and flexible architecture design. To address the challenges of scale and complexity in medical tasks, several variants of U-Net have been proposed. In particular, methods based on Vision Transformer (ViT), represented by Swin UNETR, have gained widespread attention in recent years. However, these improvements often focus on the encoder, overlooking the crucial role of the decoder in optimizing segmentation details. This design imbalance limits the potential for further enhancing segmentation performance. To address this issue, we analyze the roles of various decoder components, including upsampling method, skip connection, and feature extraction module, as well as the shortcomings of existing methods. Consequently, we propose Swin DER (i.e., Swin UNETR Decoder Enhanced and Refined) by specifically optimizing the design of these three components. Swin DER performs upsampling using learnable interpolation algorithm called offset coordinate neighborhood weighted up sampling (Onsampling) and replaces traditional skip connection with spatial-channel parallel attention gate (SCP AG). Additionally, Swin DER introduces deformable convolution along with attention mechanism in the feature extraction module of the decoder. Our model design achieves excellent results, surpassing other state-of-the-art methods on both the Synapse and the MSD brain tumor segmentation task. Code is available at: https://github.com/WillBeanYang/Swin-DER

Optimizing Medical Image Segmentation with Advanced Decoder Design

TL;DR

Swin DER is proposed, which performs upsampling using learnable interpolation algorithm called offset coordinate neighborhood weighted up sampling (Onsampling), replaces traditional skip connection with spatial-channel parallel attention gate (SCP AG), and introduces deformable convolution along with attention mechanism in the feature extraction module of the decoder.

Abstract

U-Net is widely used in medical image segmentation due to its simple and flexible architecture design. To address the challenges of scale and complexity in medical tasks, several variants of U-Net have been proposed. In particular, methods based on Vision Transformer (ViT), represented by Swin UNETR, have gained widespread attention in recent years. However, these improvements often focus on the encoder, overlooking the crucial role of the decoder in optimizing segmentation details. This design imbalance limits the potential for further enhancing segmentation performance. To address this issue, we analyze the roles of various decoder components, including upsampling method, skip connection, and feature extraction module, as well as the shortcomings of existing methods. Consequently, we propose Swin DER (i.e., Swin UNETR Decoder Enhanced and Refined) by specifically optimizing the design of these three components. Swin DER performs upsampling using learnable interpolation algorithm called offset coordinate neighborhood weighted up sampling (Onsampling) and replaces traditional skip connection with spatial-channel parallel attention gate (SCP AG). Additionally, Swin DER introduces deformable convolution along with attention mechanism in the feature extraction module of the decoder. Our model design achieves excellent results, surpassing other state-of-the-art methods on both the Synapse and the MSD brain tumor segmentation task. Code is available at: https://github.com/WillBeanYang/Swin-DER
Paper Structure (25 sections, 18 equations, 5 figures, 6 tables)

This paper contains 25 sections, 18 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The overall architecture of our Swin DER. Our design focuses on the decoder side, the feature maps output by the second set of encoder (Residual Block) are reweighted in both spatial and channel dimensions through the spatial-channel parallel attention gate and concatenated in the channel dimension with decoder feature maps upsampled by Onsampling algorithm. Subsequently, the Deformable Squeeze-and-Attention (DSA) Block further integrates these features and utilizes the combined features to learn segmentation details.
  • Figure 2: The computational process of Onsampling. Onsampling adds suitable offsets to the mapping coordinates in trilinear interpolation, enabling the learnability of sub-pixel positions. It obtains neighborhood weights through convolution, making the interpolation weights learnable. Ultimately, these weights are applied to the neighborhood pixels of the offset sub-pixels, realizing a dynamic interpolation algorithm.
  • Figure 3: The schematic diagram of the spatial-channel parallel attention gate. Spatial-Channel Parallel Attention Gate computes weight maps separately in the spatial and channel dimensions, then merges these two-dimensional weight maps to form the spatial-channel weight map. Spatial-channel weight map is used to adjust the significance of each position and channel in the encoder feature map, thereby enhancing the model's focus on important features and suppressing irrelevant information.
  • Figure 4: Visualization results of multi-organ segmentation on the Synapse dataset. We primarily compare Swin DER with other transformer-based segmentation models, such as UNETR, Swin UNETR, and nnFormer.
  • Figure 5: Brain tumor segmentation visualization. Compared to the current state-of-the-art image segmentation methods, Swin DER achieved the best results.