Table of Contents
Fetching ...

Swin SMT: Global Sequential Modeling in 3D Medical Image Segmentation

Szymon Płotka, Maciej Chrabaszcz, Przemyslaw Biecek

TL;DR

This paper addresses the challenge of segmenting 117 major anatomical structures in whole-body CT images by modeling both local and global long-range dependencies. It introduces Swin SMT, a 3D transformer-based architecture built on Swin UNETR that integrates Soft Mixture-of-Experts (Soft MoE) into Swin blocks (stages 2–4) to scale capacity while controlling computational cost, with a CNN-based decoder. Evaluated on TotalSegmentator-V2, Swin SMT achieves an average Dice Similarity Coefficient of 85.09% and demonstrates competitive inference speed (about 60 seconds) with 170.8M parameters; ablation shows performance improves with more experts, up to 32. The work highlights the potential of Soft MoE to capture diverse global representations in WBCT, offering a pathway toward accurate, clinically useful automated organ segmentation, and points to future gains from self-supervised pretraining and external-data validation.

Abstract

Recent advances in Vision Transformers (ViTs) have significantly enhanced medical image segmentation by facilitating the learning of global relationships. However, these methods face a notable challenge in capturing diverse local and global long-range sequential feature representations, particularly evident in whole-body CT (WBCT) scans. To overcome this limitation, we introduce Swin Soft Mixture Transformer (Swin SMT), a novel architecture based on Swin UNETR. This model incorporates a Soft Mixture-of-Experts (Soft MoE) to effectively handle complex and diverse long-range dependencies. The use of Soft MoE allows for scaling up model parameters maintaining a balance between computational complexity and segmentation performance in both training and inference modes. We evaluate Swin SMT on the publicly available TotalSegmentator-V2 dataset, which includes 117 major anatomical structures in WBCT images. Comprehensive experimental results demonstrate that Swin SMT outperforms several state-of-the-art methods in 3D anatomical structure segmentation, achieving an average Dice Similarity Coefficient of 85.09%. The code and pre-trained weights of Swin SMT are publicly available at https://github.com/MI2DataLab/SwinSMT.

Swin SMT: Global Sequential Modeling in 3D Medical Image Segmentation

TL;DR

This paper addresses the challenge of segmenting 117 major anatomical structures in whole-body CT images by modeling both local and global long-range dependencies. It introduces Swin SMT, a 3D transformer-based architecture built on Swin UNETR that integrates Soft Mixture-of-Experts (Soft MoE) into Swin blocks (stages 2–4) to scale capacity while controlling computational cost, with a CNN-based decoder. Evaluated on TotalSegmentator-V2, Swin SMT achieves an average Dice Similarity Coefficient of 85.09% and demonstrates competitive inference speed (about 60 seconds) with 170.8M parameters; ablation shows performance improves with more experts, up to 32. The work highlights the potential of Soft MoE to capture diverse global representations in WBCT, offering a pathway toward accurate, clinically useful automated organ segmentation, and points to future gains from self-supervised pretraining and external-data validation.

Abstract

Recent advances in Vision Transformers (ViTs) have significantly enhanced medical image segmentation by facilitating the learning of global relationships. However, these methods face a notable challenge in capturing diverse local and global long-range sequential feature representations, particularly evident in whole-body CT (WBCT) scans. To overcome this limitation, we introduce Swin Soft Mixture Transformer (Swin SMT), a novel architecture based on Swin UNETR. This model incorporates a Soft Mixture-of-Experts (Soft MoE) to effectively handle complex and diverse long-range dependencies. The use of Soft MoE allows for scaling up model parameters maintaining a balance between computational complexity and segmentation performance in both training and inference modes. We evaluate Swin SMT on the publicly available TotalSegmentator-V2 dataset, which includes 117 major anatomical structures in WBCT images. Comprehensive experimental results demonstrate that Swin SMT outperforms several state-of-the-art methods in 3D anatomical structure segmentation, achieving an average Dice Similarity Coefficient of 85.09%. The code and pre-trained weights of Swin SMT are publicly available at https://github.com/MI2DataLab/SwinSMT.
Paper Structure (9 sections, 5 equations, 6 figures, 2 tables)

This paper contains 9 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: An overview of the Swin SMT architecture. The input to our model is a 3D CT scan. The Stem creates non-overlapping patches of the input data and utilizes a patch partition layer to generate windows with a desired size for computing self-attention. Encoded feature representations in each of the encoder blocks are then fed to a Convolutional Neural Network (CNN)-based decoder via skip connections at multiple resolutions. The segmentation output consists of 118 channels, corresponding to 117 classes and background, representing the major anatomical structures in WBCT images. H, W, D, and C refer to height, width, depth, and number of feature channels, respectively. W-MSA and SW-MSA refer to window-based multi-head self-attention with regular and shifted windows, respectively.
  • Figure 1: We show partial and entire-body part CT predictions of Swin SMT. On the top, we show partial predictions of various body parts, including head-neck, chest, and pelvis parts, which can provide less robust predictions due to less contextual information. On the bottom, we show entire-body predictions, for which model is more robust due to the global and contextual sequential information provided by Swin SMT. We achieved significantly higher segmentation performance on the entire-body scans rather than partial. We demonstrate our predictions with real word scale in centimeters [cm].
  • Figure 2: An overview of the Soft MoE. Here, the router assigns the weighted average of all the input tokens (patches) to each slot, which computes logits for each input pair of tokens and slots using dispatch weights. Then, each expert processes its slots. Finally, the original logits are normalized per token and used to combine all the slot outputs for every input token. Tokens in slots are shown in decreasing order of logits assigned to this token by dispatch weights.
  • Figure 2: We show top-performing predictions of Swin SMT for various subparts of the entire-body predictions. We provide the Dice Score Coefficient (in %) to show the segmentation performance. From the left: vertebrae (93.02%), ribs with sternum and costal cartilages (95.07%), muscles (92.34%), and vessels (86.00%). We demonstrate our predictions with real word scale in centimeters [cm].
  • Figure 3: a) Distribution of the quantitative results of Swin SMT for each subgroup within WBCT images. The partial and full body part CT scans denote cropped (i.e., a sub-volume of a thoracic or abdominal) and entire body part (i.e., thoracic, abdominal, or whole-body) CT scans, b) Distribution of the average DSC against inference time (in s). The inference time calculations are based on an input patch of 128 $\times$ 128 $\times$ 128 with a sliding window algorithm and overlay of 0.5. The size of each circle indicates the number of parameters (in M).
  • ...and 1 more figures