Self-Supervised Modality-Agnostic Pre-Training of Swin Transformers
Abhiroop Talasila, Maitreya Maity, U. Deva Priyakumar
TL;DR
This work addresses domain shift in medical image segmentation by proposing SwinFUSE, a modality-agnostic, self-supervised pre-training framework that learns from CT and MRI. It extends the Swin UNETR architecture with a Domain Invariance Module to align cross-modal features and emphasize salient regions before encoding. The pre-training objective combines masked volume inpainting, 3D rotation, contrastive learning, and a KDE-based density-matching regularizer, formalized as $L_{total}=L_{inpaint}+L_{contrast}+L_{rot}-L_{JSD}$ with a KDE estimator $p_{est}(X)=\frac{1}{N}\sum_{n=1}^{N} K\left(\frac{\lVert X - X_{n}\rVert_2}{\sigma}\right)$. Across BraTS21 and MSD, SwinFUSE achieves competitive in-distribution performance while substantially outperforming single-modality baselines on out-of-distribution modalities, with improvements up to $27\%$, demonstrating practical potential for clinical deployment; code is released at https://github.com/devalab/SwinFUSE.
Abstract
Unsupervised pre-training has emerged as a transformative paradigm, displaying remarkable advancements in various domains. However, the susceptibility to domain shift, where pre-training data distribution differs from fine-tuning, poses a significant obstacle. To address this, we augment the Swin Transformer to learn from different medical imaging modalities, enhancing downstream performance. Our model, dubbed SwinFUSE (Swin Multi-Modal Fusion for UnSupervised Enhancement), offers three key advantages: (i) it learns from both Computed Tomography (CT) and Magnetic Resonance Images (MRI) during pre-training, resulting in complementary feature representations; (ii) a domain-invariance module (DIM) that effectively highlights salient input regions, enhancing adaptability; (iii) exhibits remarkable generalizability, surpassing the confines of tasks it was initially pre-trained on. Our experiments on two publicly available 3D segmentation datasets show a modest 1-2% performance trade-off compared to single-modality models, yet significant out-performance of up to 27% on out-of-distribution modality. This substantial improvement underscores our proposed approach's practical relevance and real-world applicability. Code is available at: https://github.com/devalab/SwinFUSE
