Table of Contents
Fetching ...

Self-supervised learning improves robustness of deep learning lung tumor segmentation to CT imaging differences

Jue Jiang, Aneesh Rangnekar, Harini Veeraraghavan

TL;DR

This study investigates how wild-pretraining versus self-pretraining affects robustness in CT-based lung tumor segmentation across ViT, Swin, and CNN architectures. It finds that wild-pretraining, particularly with the Swin Transformer, yields superior robustness to imaging differences (contrast, kernel, slice thickness) and improves segmentation accuracy, while the SMIT pretext task combination (MIP+ITD+MPD) most consistently enhances performance. The work connects improved accuracy to distinct feature reuse patterns revealed by $CKA$, showing wild-pretraining increases lower-layer reuse and creates greater differentiation near output layers after fine-tuning. Overall, wild-pretrained Swin models offer the strongest generalization to heterogeneous CT data, with promising implications for cross-institutional clinical deployment of lung-tumor segmentation tools.

Abstract

Self-supervised learning (SSL) is an approach to extract useful feature representations from unlabeled data, and enable fine-tuning on downstream tasks with limited labeled examples. Self-pretraining is a SSL approach that uses the curated task dataset for both pretraining the networks and fine-tuning them. Availability of large, diverse, and uncurated public medical image sets provides the opportunity to apply SSL in the "wild" and potentially extract features robust to imaging variations. However, the benefit of wild- vs self-pretraining has not been studied for medical image analysis. In this paper, we compare robustness of wild versus self-pretrained transformer (vision transformer [ViT] and hierarchical shifted window [Swin]) models to computed tomography (CT) imaging differences for non-small cell lung cancer (NSCLC) segmentation. Wild-pretrained Swin models outperformed self-pretrained Swin for the various imaging acquisitions. ViT resulted in similar accuracy for both wild- and self-pretrained models. Masked image prediction pretext task that forces networks to learn the local structure resulted in higher accuracy compared to contrastive task that models global image information. Wild-pretrained models resulted in higher feature reuse at the lower level layers and feature differentiation close to output layer after fine-tuning. Hence, we conclude: Wild-pretrained networks were more robust to analyzed CT imaging differences for lung tumor segmentation than self-pretrained methods. Swin architecture benefited from such pretraining more than ViT.

Self-supervised learning improves robustness of deep learning lung tumor segmentation to CT imaging differences

TL;DR

This study investigates how wild-pretraining versus self-pretraining affects robustness in CT-based lung tumor segmentation across ViT, Swin, and CNN architectures. It finds that wild-pretraining, particularly with the Swin Transformer, yields superior robustness to imaging differences (contrast, kernel, slice thickness) and improves segmentation accuracy, while the SMIT pretext task combination (MIP+ITD+MPD) most consistently enhances performance. The work connects improved accuracy to distinct feature reuse patterns revealed by , showing wild-pretraining increases lower-layer reuse and creates greater differentiation near output layers after fine-tuning. Overall, wild-pretrained Swin models offer the strongest generalization to heterogeneous CT data, with promising implications for cross-institutional clinical deployment of lung-tumor segmentation tools.

Abstract

Self-supervised learning (SSL) is an approach to extract useful feature representations from unlabeled data, and enable fine-tuning on downstream tasks with limited labeled examples. Self-pretraining is a SSL approach that uses the curated task dataset for both pretraining the networks and fine-tuning them. Availability of large, diverse, and uncurated public medical image sets provides the opportunity to apply SSL in the "wild" and potentially extract features robust to imaging variations. However, the benefit of wild- vs self-pretraining has not been studied for medical image analysis. In this paper, we compare robustness of wild versus self-pretrained transformer (vision transformer [ViT] and hierarchical shifted window [Swin]) models to computed tomography (CT) imaging differences for non-small cell lung cancer (NSCLC) segmentation. Wild-pretrained Swin models outperformed self-pretrained Swin for the various imaging acquisitions. ViT resulted in similar accuracy for both wild- and self-pretrained models. Masked image prediction pretext task that forces networks to learn the local structure resulted in higher accuracy compared to contrastive task that models global image information. Wild-pretrained models resulted in higher feature reuse at the lower level layers and feature differentiation close to output layer after fine-tuning. Hence, we conclude: Wild-pretrained networks were more robust to analyzed CT imaging differences for lung tumor segmentation than self-pretrained methods. Swin architecture benefited from such pretraining more than ViT.
Paper Structure (25 sections, 3 equations, 7 figures, 7 tables)

This paper contains 25 sections, 3 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Transformers architectures combined with the convolutional decoders for 3D tumor segmentation.
  • Figure 2: The scatter plot of DSC versus tumor volume (cc) to assess dependency of accuracy on the tumor volume for the analyzed architectures subjected to scratch, wild-pretraining and self-pretraining followed by fine-tuning for the two testing datasets.
  • Figure 3: Influence of CT reconstruction kernel on segmentation accuracy with (A) CNN backbone and (B) ViT backbone, (c) Swin backbone.
  • Figure 4: Segmentation (yellow contour) produced using CNN, ViT and Swin-based SMIT model using wild-pretaining and self-pretraining. (A) Contrast CT scan using GE lung reconstruction (B)Contrast CT scan using GE standard reconstruction (C) Non-contrast CT scan using GE lung reconstruction. Manual delineation is shown in red and algorithm delineation is shown in yellow.
  • Figure 5: Results on the phantom image scan (dataset). Yellow contour indicates the model segmentation and red contour indicates the phantom ground truth.
  • ...and 2 more figures