Table of Contents
Fetching ...

AdaViT: Adaptive Vision Transformer for Flexible Pretrain and Finetune with Variable 3D Medical Image Modalities

Badhan Kumar Das, Gengyan Zhao, Han Liu, Thomas J. Re, Dorin Comaniciu, Eli Gibson, Andreas Maier

TL;DR

AdaViT tackles the problem of heterogeneous input modalities in medical imaging by introducing a dynamic, modality-aware Vision Transformer that can process variable sets of MRI contrasts per case. The method combines a 3D Dynamic Convolution Tokenizer with a transformer encoder to handle variable-length modality tokens, and integrates a UNETR decoder for supervised segmentation and a masked autoencoder for self-supervised pretraining. Empirical results show AdaViT outperforms fixed-modality baselines in zero-shot, few-shot, and backward transfer scenarios for brain infarct and BraTS segmentation, and SSL further enhances performance by maximizing data usage across modalities. The approach enables flexible pretraining and finetuning, improving data utilization, transferability, and robustness across heterogeneous clinical data, with implications for continual and federated learning in practice.

Abstract

Pretrain techniques, whether supervised or self-supervised, are widely used in deep learning to enhance model performance. In real-world clinical scenarios, different sets of magnetic resonance (MR) contrasts are often acquired for different subjects/cases, creating challenges for deep learning models assuming consistent input modalities among all the cases and between pretrain and finetune. Existing methods struggle to maintain performance when there is an input modality/contrast set mismatch with the pretrained model, often resulting in degraded accuracy. We propose an adaptive Vision Transformer (AdaViT) framework capable of handling variable set of input modalities for each case. We utilize a dynamic tokenizer to encode different input image modalities to tokens and take advantage of the characteristics of the transformer to build attention mechanism across variable length of tokens. Through extensive experiments, we demonstrate that this architecture effectively transfers supervised pretrained models to new datasets with different input modality/contrast sets, resulting in superior performance on zero-shot testing, few-shot finetuning, and backward transferring in brain infarct and brain tumor segmentation tasks. Additionally, for self-supervised pretrain, the proposed method is able to maximize the pretrain data and facilitate transferring to diverse downstream tasks with variable sets of input modalities.

AdaViT: Adaptive Vision Transformer for Flexible Pretrain and Finetune with Variable 3D Medical Image Modalities

TL;DR

AdaViT tackles the problem of heterogeneous input modalities in medical imaging by introducing a dynamic, modality-aware Vision Transformer that can process variable sets of MRI contrasts per case. The method combines a 3D Dynamic Convolution Tokenizer with a transformer encoder to handle variable-length modality tokens, and integrates a UNETR decoder for supervised segmentation and a masked autoencoder for self-supervised pretraining. Empirical results show AdaViT outperforms fixed-modality baselines in zero-shot, few-shot, and backward transfer scenarios for brain infarct and BraTS segmentation, and SSL further enhances performance by maximizing data usage across modalities. The approach enables flexible pretraining and finetuning, improving data utilization, transferability, and robustness across heterogeneous clinical data, with implications for continual and federated learning in practice.

Abstract

Pretrain techniques, whether supervised or self-supervised, are widely used in deep learning to enhance model performance. In real-world clinical scenarios, different sets of magnetic resonance (MR) contrasts are often acquired for different subjects/cases, creating challenges for deep learning models assuming consistent input modalities among all the cases and between pretrain and finetune. Existing methods struggle to maintain performance when there is an input modality/contrast set mismatch with the pretrained model, often resulting in degraded accuracy. We propose an adaptive Vision Transformer (AdaViT) framework capable of handling variable set of input modalities for each case. We utilize a dynamic tokenizer to encode different input image modalities to tokens and take advantage of the characteristics of the transformer to build attention mechanism across variable length of tokens. Through extensive experiments, we demonstrate that this architecture effectively transfers supervised pretrained models to new datasets with different input modality/contrast sets, resulting in superior performance on zero-shot testing, few-shot finetuning, and backward transferring in brain infarct and brain tumor segmentation tasks. Additionally, for self-supervised pretrain, the proposed method is able to maximize the pretrain data and facilitate transferring to diverse downstream tasks with variable sets of input modalities.

Paper Structure

This paper contains 16 sections, 6 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Performance comparison of different models on brain infarct and brain tumor segmentation in zero-shot testing, few-shot finetuning, and backward transferring.
  • Figure 2: (a): Overview of AdaViT for supervised pretrain and finetune for segmentation task. (b): AdaViT in a masked autoencoder setting for self-supervised pretraining. Both architectures can handle variable set of input modalities from each case.
  • Figure 3: Reconstruction of self-supervised pretraining. First row: original image. Second row: masked image where masked patches are colored as black. Third row: reconstructed images. (Axial slices of ADC, TraceW, FLAIR, GRE, T1, T2, T1CE, and SWI are shown from left to right)