Table of Contents
Fetching ...

Self-distilled Masked Attention guided masked image modeling with noise Regularized Teacher (SMART) for medical image analysis

Jue Jiang, Aneesh Rangnekar, Chloe Min Seo Choi, Harini Veeraraghavan

TL;DR

SMART presents a Swin-based self-supervised pretraining framework that enables global attention-guided masking for medical imaging by incorporating a semantic attention module and a noisy teacher in a co-distillation setup. The method achieves superior downstream performance on 3D CT tasks, including lung nodule classification and immunotherapy response prediction, while providing interpretable attention maps and zero-shot localization capabilities. Through extensive ablations, SMART demonstrates that the combination of semantic attention, AMIP losses, and noisy teacher regularization yields robust representations even with limited labeled data. This work advances interpretable, data-efficient pretraining for medical vision transformers and broadens the applicability of attention-guided masking to Swin architectures.

Abstract

Pretraining vision transformers (ViT) with attention guided masked image modeling (MIM) has shown to increase downstream accuracy for natural image analysis. Hierarchical shifted window (Swin) transformer, often used in medical image analysis cannot use attention guided masking as it lacks an explicit [CLS] token, needed for computing attention maps for selective masking. We thus enhanced Swin with semantic class attention. We developed a co-distilled Swin transformer that combines a noisy momentum updated teacher to guide selective masking for MIM. Our approach called \textsc{s}e\textsc{m}antic \textsc{a}ttention guided co-distillation with noisy teacher \textsc{r}egularized Swin \textsc{T}rans\textsc{F}ormer (SMARTFormer) was applied for analyzing 3D computed tomography datasets with lung nodules and malignant lung cancers (LC). We also analyzed the impact of semantic attention and noisy teacher on pretraining and downstream accuracy. SMARTFormer classified lesions (malignant from benign) with a high accuracy of 0.895 of 1000 nodules, predicted LC treatment response with accuracy of 0.74, and achieved high accuracies even in limited data regimes. Pretraining with semantic attention and noisy teacher improved ability to distinguish semantically meaningful structures such as organs in a unsupervised clustering task and localize abnormal structures like tumors. Code, models will be made available through GitHub upon paper acceptance.

Self-distilled Masked Attention guided masked image modeling with noise Regularized Teacher (SMART) for medical image analysis

TL;DR

SMART presents a Swin-based self-supervised pretraining framework that enables global attention-guided masking for medical imaging by incorporating a semantic attention module and a noisy teacher in a co-distillation setup. The method achieves superior downstream performance on 3D CT tasks, including lung nodule classification and immunotherapy response prediction, while providing interpretable attention maps and zero-shot localization capabilities. Through extensive ablations, SMART demonstrates that the combination of semantic attention, AMIP losses, and noisy teacher regularization yields robust representations even with limited labeled data. This work advances interpretable, data-efficient pretraining for medical vision transformers and broadens the applicability of attention-guided masking to Swin architectures.

Abstract

Pretraining vision transformers (ViT) with attention guided masked image modeling (MIM) has shown to increase downstream accuracy for natural image analysis. Hierarchical shifted window (Swin) transformer, often used in medical image analysis cannot use attention guided masking as it lacks an explicit [CLS] token, needed for computing attention maps for selective masking. We thus enhanced Swin with semantic class attention. We developed a co-distilled Swin transformer that combines a noisy momentum updated teacher to guide selective masking for MIM. Our approach called \textsc{s}e\textsc{m}antic \textsc{a}ttention guided co-distillation with noisy teacher \textsc{r}egularized Swin \textsc{T}rans\textsc{F}ormer (SMARTFormer) was applied for analyzing 3D computed tomography datasets with lung nodules and malignant lung cancers (LC). We also analyzed the impact of semantic attention and noisy teacher on pretraining and downstream accuracy. SMARTFormer classified lesions (malignant from benign) with a high accuracy of 0.895 of 1000 nodules, predicted LC treatment response with accuracy of 0.74, and achieved high accuracies even in limited data regimes. Pretraining with semantic attention and noisy teacher improved ability to distinguish semantically meaningful structures such as organs in a unsupervised clustering task and localize abnormal structures like tumors. Code, models will be made available through GitHub upon paper acceptance.
Paper Structure (29 sections, 6 equations, 11 figures, 6 tables)

This paper contains 29 sections, 6 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: The key components of SMART (a) and the performance on LIDC data with Semantic Attention(SA) and Noisy teacher(NT)
  • Figure 2: SMART: (a) shows the attention-guided masking for student and (b) shows the noisy teacher for the exponential moving average(EMA) process. (c) shows the detailed attention guided masking for student and (d) depicts the different outputs produced by newly-modified Swin transformer. We use [CLS] to represent the class token embedding obtained by integrating the Semantic Attention (SA) layer into stage #3 of the Swin transformer. We use an additional prediction head (Decoder) for the masked image prediction.
  • Figure 3: UMAP-based clustering of features extracted by various analyzed pretrained models applied to the OrganMNIST data without additional fine tuning.
  • Figure 4: Attention maps computed for example images by AttMask (b,c) and SMART (d,e) after pretraining and fine tuning on the Immunotherapy dataset for response prediction. Tumors are indicated within red bounding boxes.
  • Figure 5: Attention map for AttMask (b) and SMART (c) for zero-shot segmentation of lung tumor (a) on an example from the zero-shot 5R dataset. Tumor is indicated within red bounding box.
  • ...and 6 more figures