Self-supervised learning improves robustness of deep learning lung tumor segmentation to CT imaging differences
Jue Jiang, Aneesh Rangnekar, Harini Veeraraghavan
TL;DR
This study investigates how wild-pretraining versus self-pretraining affects robustness in CT-based lung tumor segmentation across ViT, Swin, and CNN architectures. It finds that wild-pretraining, particularly with the Swin Transformer, yields superior robustness to imaging differences (contrast, kernel, slice thickness) and improves segmentation accuracy, while the SMIT pretext task combination (MIP+ITD+MPD) most consistently enhances performance. The work connects improved accuracy to distinct feature reuse patterns revealed by $CKA$, showing wild-pretraining increases lower-layer reuse and creates greater differentiation near output layers after fine-tuning. Overall, wild-pretrained Swin models offer the strongest generalization to heterogeneous CT data, with promising implications for cross-institutional clinical deployment of lung-tumor segmentation tools.
Abstract
Self-supervised learning (SSL) is an approach to extract useful feature representations from unlabeled data, and enable fine-tuning on downstream tasks with limited labeled examples. Self-pretraining is a SSL approach that uses the curated task dataset for both pretraining the networks and fine-tuning them. Availability of large, diverse, and uncurated public medical image sets provides the opportunity to apply SSL in the "wild" and potentially extract features robust to imaging variations. However, the benefit of wild- vs self-pretraining has not been studied for medical image analysis. In this paper, we compare robustness of wild versus self-pretrained transformer (vision transformer [ViT] and hierarchical shifted window [Swin]) models to computed tomography (CT) imaging differences for non-small cell lung cancer (NSCLC) segmentation. Wild-pretrained Swin models outperformed self-pretrained Swin for the various imaging acquisitions. ViT resulted in similar accuracy for both wild- and self-pretrained models. Masked image prediction pretext task that forces networks to learn the local structure resulted in higher accuracy compared to contrastive task that models global image information. Wild-pretrained models resulted in higher feature reuse at the lower level layers and feature differentiation close to output layer after fine-tuning. Hence, we conclude: Wild-pretrained networks were more robust to analyzed CT imaging differences for lung tumor segmentation than self-pretrained methods. Swin architecture benefited from such pretraining more than ViT.
