Table of Contents
Fetching ...

CSWin-UNet: Transformer UNet with Cross-Shaped Windows for Medical Image Segmentation

Xiao Liu, Peng Gao, Tao Yu, Fei Wang, Ru-Yue Yuan

TL;DR

CSWin-UNet addresses the need for efficient, long-range contextual modeling in medical image segmentation by embedding CSWin self-attention into a U-shaped Transformer architecture and using CARAFE for edge-preserving upsampling. The approach achieves state-of-the-art or competitive segmentation accuracy across CT, MRI, and skin lesion datasets while maintaining low model complexity. Key contributions include the CSWin Transformer block with horizontal and vertical stripe attention, four-stage encoder–decoder design, and CARAFE-based decoding with skip connections and a Dice+cross-entropy loss. The results demonstrate robust cross-modal performance, improved boundary delineation, and favorable computational efficiency, making it attractive for deployment on resource-constrained platforms. Limitations include variability in challenging regions and the impact of pretraining, suggesting future work on end-to-end training and further architectural refinements.

Abstract

Deep learning, especially convolutional neural networks (CNNs) and Transformer architectures, have become the focus of extensive research in medical image segmentation, achieving impressive results. However, CNNs come with inductive biases that limit their effectiveness in more complex, varied segmentation scenarios. Conversely, while Transformer-based methods excel at capturing global and long-range semantic details, they suffer from high computational demands. In this study, we propose CSWin-UNet, a novel U-shaped segmentation method that incorporates the CSWin self-attention mechanism into the UNet to facilitate horizontal and vertical stripes self-attention. This method significantly enhances both computational efficiency and receptive field interactions. Additionally, our innovative decoder utilizes a content-aware reassembly operator that strategically reassembles features, guided by predicted kernels, for precise image resolution restoration. Our extensive empirical evaluations on diverse datasets, including synapse multi-organ CT, cardiac MRI, and skin lesions, demonstrate that CSWin-UNet maintains low model complexity while delivering high segmentation accuracy. Codes are available at https://github.com/eatbeanss/CSWin-UNet.

CSWin-UNet: Transformer UNet with Cross-Shaped Windows for Medical Image Segmentation

TL;DR

CSWin-UNet addresses the need for efficient, long-range contextual modeling in medical image segmentation by embedding CSWin self-attention into a U-shaped Transformer architecture and using CARAFE for edge-preserving upsampling. The approach achieves state-of-the-art or competitive segmentation accuracy across CT, MRI, and skin lesion datasets while maintaining low model complexity. Key contributions include the CSWin Transformer block with horizontal and vertical stripe attention, four-stage encoder–decoder design, and CARAFE-based decoding with skip connections and a Dice+cross-entropy loss. The results demonstrate robust cross-modal performance, improved boundary delineation, and favorable computational efficiency, making it attractive for deployment on resource-constrained platforms. Limitations include variability in challenging regions and the impact of pretraining, suggesting future work on end-to-end training and further architectural refinements.

Abstract

Deep learning, especially convolutional neural networks (CNNs) and Transformer architectures, have become the focus of extensive research in medical image segmentation, achieving impressive results. However, CNNs come with inductive biases that limit their effectiveness in more complex, varied segmentation scenarios. Conversely, while Transformer-based methods excel at capturing global and long-range semantic details, they suffer from high computational demands. In this study, we propose CSWin-UNet, a novel U-shaped segmentation method that incorporates the CSWin self-attention mechanism into the UNet to facilitate horizontal and vertical stripes self-attention. This method significantly enhances both computational efficiency and receptive field interactions. Additionally, our innovative decoder utilizes a content-aware reassembly operator that strategically reassembles features, guided by predicted kernels, for precise image resolution restoration. Our extensive empirical evaluations on diverse datasets, including synapse multi-organ CT, cardiac MRI, and skin lesions, demonstrate that CSWin-UNet maintains low model complexity while delivering high segmentation accuracy. Codes are available at https://github.com/eatbeanss/CSWin-UNet.
Paper Structure (26 sections, 10 equations, 9 figures, 8 tables)

This paper contains 26 sections, 10 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Illustration of different self-attention mechanisms. $h_i$ denotes the $i$-th attention head.
  • Figure 2: Overview of the proposed CSWin-UNet. The decoder and encoder are symmetrical and each consists of four stages.
  • Figure 3: Pipeline of the CSWin Transformer Block.
  • Figure 4: Illustration of the CSWin self-attention mechanism. First, split the multiple heads $\{h_1,h_2,\ldots,h_N\}$ into two groups $\{h_1,h_2,\ldots,h_{N/2}\}$ and $\{h_{N/2+1},h_{N/2+2},\ldots,h_N\}$, performing self-attention in parallel on the horizontal and vertical stripes, respectively, and concatenate the outputs. Next, the width of the stripe $sw$ can be adjusted to achieve optimal performance. Generally, choose a smaller $sw$ for higher resolutions and a larger $sw$ for lower resolutions.
  • Figure 5: Error bars (95% confidence interval) of mean DSC, mean HD, and DSC for each organ on the Synapse dataset.
  • ...and 4 more figures