Table of Contents
Fetching ...

Exploring scalable medical image encoders beyond text supervision

Fernando Pérez-García, Harshita Sharma, Sam Bond-Taylor, Kenza Bouzid, Valentina Salvatelli, Maximilian Ilse, Shruthi Bannur, Daniel C. Castro, Anton Schwaighofer, Matthew P. Lungren, Maria Teodora Wetscherek, Noel Codella, Stephanie L. Hyland, Javier Alvarez-Valle, Ozan Oktay

TL;DR

Rad-DINO shows that high-quality biomedical image encoders can be trained purely from unimodal imaging data without text supervision. It adapts the DINOv2 framework with masked image modelling to learn global and local features at scale across chest X-ray datasets. Across classification, segmentation, and radiology report generation benchmarks, Rad-DINO matches or surpasses state-of-the-art language-supervised models, and its representations correlate more with patient records than language-supervised encoders. The results support a scalable, modality-decoupled pathway for foundational biomedical image encoders and suggest broad applicability to other medical imaging domains.

Abstract

Language-supervised pre-training has proven to be a valuable method for extracting semantically meaningful features from images, serving as a foundational element in multimodal systems within the computer vision and medical imaging domains. However, the computed features are limited by the information contained in the text, which is particularly problematic in medical imaging, where the findings described by radiologists focus on specific observations. This challenge is compounded by the scarcity of paired imaging-text data due to concerns over leakage of personal health information. In this work, we fundamentally challenge the prevailing reliance on language supervision for learning general-purpose biomedical imaging encoders. We introduce RAD-DINO, a biomedical image encoder pre-trained solely on unimodal biomedical imaging data that obtains similar or greater performance than state-of-the-art biomedical language-supervised models on a diverse range of benchmarks. Specifically, the quality of learned representations is evaluated on standard imaging tasks (classification and semantic segmentation), and a vision-language alignment task (text report generation from images). To further demonstrate the drawback of language supervision, we show that features from RAD-DINO correlate with other medical records (e.g., sex or age) better than language-supervised models, which are generally not mentioned in radiology reports. Finally, we conduct a series of ablations determining the factors in RAD-DINO's performance; notably, we observe that RAD-DINO's downstream performance scales well with the quantity and diversity of training data, demonstrating that image-only supervision is a scalable approach for training a foundational biomedical image encoder. Model weights of RAD-DINO trained on publicly available datasets are available at https://huggingface.co/microsoft/rad-dino.

Exploring scalable medical image encoders beyond text supervision

TL;DR

Rad-DINO shows that high-quality biomedical image encoders can be trained purely from unimodal imaging data without text supervision. It adapts the DINOv2 framework with masked image modelling to learn global and local features at scale across chest X-ray datasets. Across classification, segmentation, and radiology report generation benchmarks, Rad-DINO matches or surpasses state-of-the-art language-supervised models, and its representations correlate more with patient records than language-supervised encoders. The results support a scalable, modality-decoupled pathway for foundational biomedical image encoders and suggest broad applicability to other medical imaging domains.

Abstract

Language-supervised pre-training has proven to be a valuable method for extracting semantically meaningful features from images, serving as a foundational element in multimodal systems within the computer vision and medical imaging domains. However, the computed features are limited by the information contained in the text, which is particularly problematic in medical imaging, where the findings described by radiologists focus on specific observations. This challenge is compounded by the scarcity of paired imaging-text data due to concerns over leakage of personal health information. In this work, we fundamentally challenge the prevailing reliance on language supervision for learning general-purpose biomedical imaging encoders. We introduce RAD-DINO, a biomedical image encoder pre-trained solely on unimodal biomedical imaging data that obtains similar or greater performance than state-of-the-art biomedical language-supervised models on a diverse range of benchmarks. Specifically, the quality of learned representations is evaluated on standard imaging tasks (classification and semantic segmentation), and a vision-language alignment task (text report generation from images). To further demonstrate the drawback of language supervision, we show that features from RAD-DINO correlate with other medical records (e.g., sex or age) better than language-supervised models, which are generally not mentioned in radiology reports. Finally, we conduct a series of ablations determining the factors in RAD-DINO's performance; notably, we observe that RAD-DINO's downstream performance scales well with the quantity and diversity of training data, demonstrating that image-only supervision is a scalable approach for training a foundational biomedical image encoder. Model weights of RAD-DINO trained on publicly available datasets are available at https://huggingface.co/microsoft/rad-dino.
Paper Structure (61 sections, 10 figures, 12 tables)

This paper contains 61 sections, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Rad-DINO overview. (a) Model architecture highlighting the training process using image-level and patch-level objectives, and pre-trained Rad-DINO encoder applied on downstream tasks by training task-specific heads. (b) Summary of pre-training and evaluation datasets. (c) Summary of results for image classification (\ref{['tab:classification_benchmark_vindr', 'tab:classification_benchmarks_candid_ptx']}), semantic segmentation (\ref{['tab:segmentation_benchmarks']}) and report generation (\ref{['tab:findings_generation']}) downstream tasks. Rad-DINO (L) and Rad-DINO (U) refer to linear and UPerNet decoder segmentation heads, respectively.
  • Figure 2: Visual token embedding similarities between pairs of CXR images, computed with Rad-DINO, are shown with respect to a token marked on each query image with a circle. The two manually-picked query tokens (in yellow, left, and purple, right) highlight consolidation and a lung nodule, respectively. For each query token, its similarity to the token embeddings of the target image is highlighted in yellow and is proportional to the heatmap brightness. Rad-DINO can match findings across images from different subjects, thanks to the features learnt during SSL training.
  • Figure B.1: Linear probing results on VinDr-CXR vs. input image resolution, where each given resolution is used for pre-training and inference. This demonstrates that, particularly for large-scale findings, the superior performance of Rad-DINO is not driven by its capability to encode higher resolution inputs. Data is presented as mean $\pm$ standard deviation.
  • Figure B.2: Linear probing performance on VinDr-CXR vs number of training images used in Rad-DINO pre-training. Data is presented as mean $\pm$ standard deviation.
  • Figure C.1: Linear probing results for pneumothorax and chest tubes obtained on the CANDID-PTX dataset feng2021curation, for different image resolutions. Both pre-training and inference settings are adapted for the given input resolution. Data is presented as mean $\pm$ standard deviation.
  • ...and 5 more figures