Table of Contents
Fetching ...

MedDChest: A Content-Aware Multimodal Foundational Vision Model for Thoracic Imaging

Mahmoud Soliman, Islam Osman, Mohamed S. Shehata, Rasika Rajapakshe

TL;DR

MedDChest addresses the domain gap between natural-image pre-training and thoracic medical imaging by pre-training a Vision Transformer from scratch on a large, in-domain multimodal thoracic dataset. It introduces Guided Random Resized Crop via Content-Guided Multi-Crop Augmentation to focus learning on anatomically relevant regions and demonstrates that in-domain pre-training yields superior representations for thoracic tasks. The approach achieves state-of-the-art linear probing performance on pneumonia detection (AUROC 99.8%) and ChestX-ray14 classification (accuracy 94.5%), outperforming ImageNet-pretrained baselines and MedMAE. By releasing the model weights, it provides a robust foundation for future thoracic AI research and clinical translation.

Abstract

The performance of vision models in medical imaging is often hindered by the prevailing paradigm of fine-tuning backbones pre-trained on out-of-domain natural images. To address this fundamental domain gap, we propose MedDChest, a new foundational Vision Transformer (ViT) model optimized specifically for thoracic imaging. We pre-trained MedDChest from scratch on a massive, curated, multimodal dataset of over 1.2 million images, encompassing different modalities including Chest X-ray and Computed Tomography (CT) compiled from 10 public sources. A core technical contribution of our work is Guided Random Resized Crops, a novel content-aware data augmentation strategy that biases sampling towards anatomically relevant regions, overcoming the inefficiency of standard cropping techniques on medical scans. We validate our model's effectiveness by fine-tuning it on a diverse set of downstream diagnostic tasks. Comprehensive experiments empirically demonstrate that MedDChest significantly outperforms strong, publicly available ImageNet-pretrained models. By establishing the superiority of large-scale, in-domain pre-training combined with domain-specific data augmentation, MedDChest provides a powerful and robust feature extractor that serves as a significantly better starting point for a wide array of thoracic diagnostic tasks. The model weights will be made publicly available to foster future research and applications.

MedDChest: A Content-Aware Multimodal Foundational Vision Model for Thoracic Imaging

TL;DR

MedDChest addresses the domain gap between natural-image pre-training and thoracic medical imaging by pre-training a Vision Transformer from scratch on a large, in-domain multimodal thoracic dataset. It introduces Guided Random Resized Crop via Content-Guided Multi-Crop Augmentation to focus learning on anatomically relevant regions and demonstrates that in-domain pre-training yields superior representations for thoracic tasks. The approach achieves state-of-the-art linear probing performance on pneumonia detection (AUROC 99.8%) and ChestX-ray14 classification (accuracy 94.5%), outperforming ImageNet-pretrained baselines and MedMAE. By releasing the model weights, it provides a robust foundation for future thoracic AI research and clinical translation.

Abstract

The performance of vision models in medical imaging is often hindered by the prevailing paradigm of fine-tuning backbones pre-trained on out-of-domain natural images. To address this fundamental domain gap, we propose MedDChest, a new foundational Vision Transformer (ViT) model optimized specifically for thoracic imaging. We pre-trained MedDChest from scratch on a massive, curated, multimodal dataset of over 1.2 million images, encompassing different modalities including Chest X-ray and Computed Tomography (CT) compiled from 10 public sources. A core technical contribution of our work is Guided Random Resized Crops, a novel content-aware data augmentation strategy that biases sampling towards anatomically relevant regions, overcoming the inefficiency of standard cropping techniques on medical scans. We validate our model's effectiveness by fine-tuning it on a diverse set of downstream diagnostic tasks. Comprehensive experiments empirically demonstrate that MedDChest significantly outperforms strong, publicly available ImageNet-pretrained models. By establishing the superiority of large-scale, in-domain pre-training combined with domain-specific data augmentation, MedDChest provides a powerful and robust feature extractor that serves as a significantly better starting point for a wide array of thoracic diagnostic tasks. The model weights will be made publicly available to foster future research and applications.

Paper Structure

This paper contains 15 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Multi-crop augmentation process for medical image analysis. The input medical image is processed to generate multiple views: 2 global crops at higher resolution and 8 local crops at lower resolution for comprehensive feature extraction.
  • Figure 2: DINOv2 Self-Supervised Learning Architecture in MedD. The asymmetric data augmentation strategy feeds global crops to both student and teacher networks, while local crops are only fed to the student network. The teacher network is updated via exponential moving average (EMA) and receives no gradient updates.