MedDChest: A Content-Aware Multimodal Foundational Vision Model for Thoracic Imaging

Mahmoud Soliman; Islam Osman; Mohamed S. Shehata; Rasika Rajapakshe

MedDChest: A Content-Aware Multimodal Foundational Vision Model for Thoracic Imaging

Mahmoud Soliman, Islam Osman, Mohamed S. Shehata, Rasika Rajapakshe

TL;DR

MedDChest addresses the domain gap between natural-image pre-training and thoracic medical imaging by pre-training a Vision Transformer from scratch on a large, in-domain multimodal thoracic dataset. It introduces Guided Random Resized Crop via Content-Guided Multi-Crop Augmentation to focus learning on anatomically relevant regions and demonstrates that in-domain pre-training yields superior representations for thoracic tasks. The approach achieves state-of-the-art linear probing performance on pneumonia detection (AUROC 99.8%) and ChestX-ray14 classification (accuracy 94.5%), outperforming ImageNet-pretrained baselines and MedMAE. By releasing the model weights, it provides a robust foundation for future thoracic AI research and clinical translation.

Abstract

The performance of vision models in medical imaging is often hindered by the prevailing paradigm of fine-tuning backbones pre-trained on out-of-domain natural images. To address this fundamental domain gap, we propose MedDChest, a new foundational Vision Transformer (ViT) model optimized specifically for thoracic imaging. We pre-trained MedDChest from scratch on a massive, curated, multimodal dataset of over 1.2 million images, encompassing different modalities including Chest X-ray and Computed Tomography (CT) compiled from 10 public sources. A core technical contribution of our work is Guided Random Resized Crops, a novel content-aware data augmentation strategy that biases sampling towards anatomically relevant regions, overcoming the inefficiency of standard cropping techniques on medical scans. We validate our model's effectiveness by fine-tuning it on a diverse set of downstream diagnostic tasks. Comprehensive experiments empirically demonstrate that MedDChest significantly outperforms strong, publicly available ImageNet-pretrained models. By establishing the superiority of large-scale, in-domain pre-training combined with domain-specific data augmentation, MedDChest provides a powerful and robust feature extractor that serves as a significantly better starting point for a wide array of thoracic diagnostic tasks. The model weights will be made publicly available to foster future research and applications.

MedDChest: A Content-Aware Multimodal Foundational Vision Model for Thoracic Imaging

TL;DR

Abstract

MedDChest: A Content-Aware Multimodal Foundational Vision Model for Thoracic Imaging

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)