Table of Contents
Fetching ...

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation

Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, Manning Wang

TL;DR

<3-5 sentence high-level summary> The paper addresses the limitation of CNNs in capturing global dependencies for medical image segmentation. It introduces Swin-Unet, a pure Transformer U-Net-like architecture built from Swin Transformer blocks with patch merging and a novel patch expanding upsampling, augmented by skip connections. Experiments on Synapse and ACDC demonstrate strong segmentation accuracy and robust generalization, achieving state-of-the-art or competitive results with improved boundary delineation. The work showcases the viability of pure Transformer architectures for 2D medical image segmentation and suggests directions for pretraining strategies and eventual extension to 3D data.

Abstract

In the past few years, convolutional neural networks (CNNs) have achieved milestones in medical image analysis. Especially, the deep neural networks based on U-shaped architecture and skip-connections have been widely applied in a variety of medical image tasks. However, although CNN has achieved excellent performance, it cannot learn global and long-range semantic information interaction well due to the locality of the convolution operation. In this paper, we propose Swin-Unet, which is an Unet-like pure Transformer for medical image segmentation. The tokenized image patches are fed into the Transformer-based U-shaped Encoder-Decoder architecture with skip-connections for local-global semantic feature learning. Specifically, we use hierarchical Swin Transformer with shifted windows as the encoder to extract context features. And a symmetric Swin Transformer-based decoder with patch expanding layer is designed to perform the up-sampling operation to restore the spatial resolution of the feature maps. Under the direct down-sampling and up-sampling of the inputs and outputs by 4x, experiments on multi-organ and cardiac segmentation tasks demonstrate that the pure Transformer-based U-shaped Encoder-Decoder network outperforms those methods with full-convolution or the combination of transformer and convolution. The codes and trained models will be publicly available at https://github.com/HuCaoFighting/Swin-Unet.

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation

TL;DR

<3-5 sentence high-level summary> The paper addresses the limitation of CNNs in capturing global dependencies for medical image segmentation. It introduces Swin-Unet, a pure Transformer U-Net-like architecture built from Swin Transformer blocks with patch merging and a novel patch expanding upsampling, augmented by skip connections. Experiments on Synapse and ACDC demonstrate strong segmentation accuracy and robust generalization, achieving state-of-the-art or competitive results with improved boundary delineation. The work showcases the viability of pure Transformer architectures for 2D medical image segmentation and suggests directions for pretraining strategies and eventual extension to 3D data.

Abstract

In the past few years, convolutional neural networks (CNNs) have achieved milestones in medical image analysis. Especially, the deep neural networks based on U-shaped architecture and skip-connections have been widely applied in a variety of medical image tasks. However, although CNN has achieved excellent performance, it cannot learn global and long-range semantic information interaction well due to the locality of the convolution operation. In this paper, we propose Swin-Unet, which is an Unet-like pure Transformer for medical image segmentation. The tokenized image patches are fed into the Transformer-based U-shaped Encoder-Decoder architecture with skip-connections for local-global semantic feature learning. Specifically, we use hierarchical Swin Transformer with shifted windows as the encoder to extract context features. And a symmetric Swin Transformer-based decoder with patch expanding layer is designed to perform the up-sampling operation to restore the spatial resolution of the feature maps. Under the direct down-sampling and up-sampling of the inputs and outputs by 4x, experiments on multi-organ and cardiac segmentation tasks demonstrate that the pure Transformer-based U-shaped Encoder-Decoder network outperforms those methods with full-convolution or the combination of transformer and convolution. The codes and trained models will be publicly available at https://github.com/HuCaoFighting/Swin-Unet.

Paper Structure

This paper contains 26 sections, 5 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The architecture of Swin-Unet, which is composed of encoder, bottleneck, decoder and skip connections. Encoder, bottleneck and decoder are all constructed based on swin transformer block.
  • Figure 2: Swin transformer block.
  • Figure 3: The segmentation results of different methods on the Synapse multi-organ CT dataset.