Table of Contents
Fetching ...

MedMAE: A Self-Supervised Backbone for Medical Imaging Tasks

Anubhav Gupta, Islam Osman, Mohamed S. Shehata, John W. Braun

TL;DR

This work tackles data scarcity and domain shift in medical imaging by compiling a large unlabeled Medical Imaging Dataset (MID) and training a Vision Transformer–based MedMAE backbone via Masked Autoencoder pretraining. The approach yields a versatile, domain-specific representation learned through self-supervision that transfers effectively to diverse medical tasks, including quality control, cancer prediction, pneumonia detection, and segmentation. Across four tasks, MedMAE outperforms ImageNet-pretrained and standard MAE baselines, with average gains around 8%. These results demonstrate the value of domain-specific self-supervised pretraining for medical imaging and point toward continual learning approaches to support multi-task, single-model deployment.

Abstract

Medical imaging tasks are very challenging due to the lack of publicly available labeled datasets. Hence, it is difficult to achieve high performance with existing deep-learning models as they require a massive labeled dataset to be trained effectively. An alternative solution is to use pre-trained models and fine-tune them using the medical imaging dataset. However, all existing models are pre-trained using natural images, which is a completely different domain from that of medical imaging, which leads to poor performance due to domain shift. To overcome these problems, we propose a large-scale unlabeled dataset of medical images and a backbone pre-trained using the proposed dataset with a self-supervised learning technique called Masked autoencoder. This backbone can be used as a pre-trained model for any medical imaging task, as it is trained to learn a visual representation of different types of medical images. To evaluate the performance of the proposed backbone, we used four different medical imaging tasks. The results are compared with existing pre-trained models. These experiments show the superiority of our proposed backbone in medical imaging tasks.

MedMAE: A Self-Supervised Backbone for Medical Imaging Tasks

TL;DR

This work tackles data scarcity and domain shift in medical imaging by compiling a large unlabeled Medical Imaging Dataset (MID) and training a Vision Transformer–based MedMAE backbone via Masked Autoencoder pretraining. The approach yields a versatile, domain-specific representation learned through self-supervision that transfers effectively to diverse medical tasks, including quality control, cancer prediction, pneumonia detection, and segmentation. Across four tasks, MedMAE outperforms ImageNet-pretrained and standard MAE baselines, with average gains around 8%. These results demonstrate the value of domain-specific self-supervised pretraining for medical imaging and point toward continual learning approaches to support multi-task, single-model deployment.

Abstract

Medical imaging tasks are very challenging due to the lack of publicly available labeled datasets. Hence, it is difficult to achieve high performance with existing deep-learning models as they require a massive labeled dataset to be trained effectively. An alternative solution is to use pre-trained models and fine-tune them using the medical imaging dataset. However, all existing models are pre-trained using natural images, which is a completely different domain from that of medical imaging, which leads to poor performance due to domain shift. To overcome these problems, we propose a large-scale unlabeled dataset of medical images and a backbone pre-trained using the proposed dataset with a self-supervised learning technique called Masked autoencoder. This backbone can be used as a pre-trained model for any medical imaging task, as it is trained to learn a visual representation of different types of medical images. To evaluate the performance of the proposed backbone, we used four different medical imaging tasks. The results are compared with existing pre-trained models. These experiments show the superiority of our proposed backbone in medical imaging tasks.
Paper Structure (15 sections, 1 equation, 3 figures, 7 tables)

This paper contains 15 sections, 1 equation, 3 figures, 7 tables.

Figures (3)

  • Figure 1: MedMAE architecture: The process is initiated by randomly masking 75% of the original image and inputting the remaining 25% of visible patches into the encoder, which captures the latent representations and encodes the patches. Subsequently, the aim of the decoder is to reconstruct the complete image using the encoded and masked patches. The reconstruction loss helps to improve the reconstruction with each iteration.
  • Figure 2: Image construction using pre-trained MAE with natural images.
  • Figure 3: Image construction using our proposed MedMAE.