Table of Contents
Fetching ...

Self-Supervised Modality-Agnostic Pre-Training of Swin Transformers

Abhiroop Talasila, Maitreya Maity, U. Deva Priyakumar

TL;DR

This work addresses domain shift in medical image segmentation by proposing SwinFUSE, a modality-agnostic, self-supervised pre-training framework that learns from CT and MRI. It extends the Swin UNETR architecture with a Domain Invariance Module to align cross-modal features and emphasize salient regions before encoding. The pre-training objective combines masked volume inpainting, 3D rotation, contrastive learning, and a KDE-based density-matching regularizer, formalized as $L_{total}=L_{inpaint}+L_{contrast}+L_{rot}-L_{JSD}$ with a KDE estimator $p_{est}(X)=\frac{1}{N}\sum_{n=1}^{N} K\left(\frac{\lVert X - X_{n}\rVert_2}{\sigma}\right)$. Across BraTS21 and MSD, SwinFUSE achieves competitive in-distribution performance while substantially outperforming single-modality baselines on out-of-distribution modalities, with improvements up to $27\%$, demonstrating practical potential for clinical deployment; code is released at https://github.com/devalab/SwinFUSE.

Abstract

Unsupervised pre-training has emerged as a transformative paradigm, displaying remarkable advancements in various domains. However, the susceptibility to domain shift, where pre-training data distribution differs from fine-tuning, poses a significant obstacle. To address this, we augment the Swin Transformer to learn from different medical imaging modalities, enhancing downstream performance. Our model, dubbed SwinFUSE (Swin Multi-Modal Fusion for UnSupervised Enhancement), offers three key advantages: (i) it learns from both Computed Tomography (CT) and Magnetic Resonance Images (MRI) during pre-training, resulting in complementary feature representations; (ii) a domain-invariance module (DIM) that effectively highlights salient input regions, enhancing adaptability; (iii) exhibits remarkable generalizability, surpassing the confines of tasks it was initially pre-trained on. Our experiments on two publicly available 3D segmentation datasets show a modest 1-2% performance trade-off compared to single-modality models, yet significant out-performance of up to 27% on out-of-distribution modality. This substantial improvement underscores our proposed approach's practical relevance and real-world applicability. Code is available at: https://github.com/devalab/SwinFUSE

Self-Supervised Modality-Agnostic Pre-Training of Swin Transformers

TL;DR

This work addresses domain shift in medical image segmentation by proposing SwinFUSE, a modality-agnostic, self-supervised pre-training framework that learns from CT and MRI. It extends the Swin UNETR architecture with a Domain Invariance Module to align cross-modal features and emphasize salient regions before encoding. The pre-training objective combines masked volume inpainting, 3D rotation, contrastive learning, and a KDE-based density-matching regularizer, formalized as with a KDE estimator . Across BraTS21 and MSD, SwinFUSE achieves competitive in-distribution performance while substantially outperforming single-modality baselines on out-of-distribution modalities, with improvements up to , demonstrating practical potential for clinical deployment; code is released at https://github.com/devalab/SwinFUSE.

Abstract

Unsupervised pre-training has emerged as a transformative paradigm, displaying remarkable advancements in various domains. However, the susceptibility to domain shift, where pre-training data distribution differs from fine-tuning, poses a significant obstacle. To address this, we augment the Swin Transformer to learn from different medical imaging modalities, enhancing downstream performance. Our model, dubbed SwinFUSE (Swin Multi-Modal Fusion for UnSupervised Enhancement), offers three key advantages: (i) it learns from both Computed Tomography (CT) and Magnetic Resonance Images (MRI) during pre-training, resulting in complementary feature representations; (ii) a domain-invariance module (DIM) that effectively highlights salient input regions, enhancing adaptability; (iii) exhibits remarkable generalizability, surpassing the confines of tasks it was initially pre-trained on. Our experiments on two publicly available 3D segmentation datasets show a modest 1-2% performance trade-off compared to single-modality models, yet significant out-performance of up to 27% on out-of-distribution modality. This substantial improvement underscores our proposed approach's practical relevance and real-world applicability. Code is available at: https://github.com/devalab/SwinFUSE
Paper Structure (10 sections, 4 equations, 3 figures, 2 tables)

This paper contains 10 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Visual interpretation of SwinFUSE's attention weights (darker shades indicate higher relevance) for a BraTS21 MRI and the model's segmentation output.
  • Figure 2: Outline of our proposed pre-training pipeline. Sub-volumes are randomly created from input images and augmented with random inner cutouts and rotations ($x_{i}, x_{j}$). Each augmentation passes through the patch partition layer to generate embeddings, which are fed to the DIM. The output from the DIM is extracted as kernel densities and forwarded to the Swin Transformer.
  • Figure 3: Qualitative visualizations of Swin UNETR and our proposed method. Colored regions correspond to necrotic tumor core (red), peritumoral edematous tissue (pink), and enhancing tumor (blue). Dice scores are also given.