Table of Contents
Fetching ...

Advancing human-centric AI for robust X-ray analysis through holistic self-supervised learning

Théo Moutakanni, Piotr Bojanowski, Guillaume Chassagnon, Céline Hudelot, Armand Joulin, Yann LeCun, Matthew Muckley, Maxime Oquab, Marie-Pierre Revel, Maria Vakalopoulou

TL;DR

RayDINO addresses the need for robust, fair, and holistic chest X-ray analysis using a large self-supervised vision transformer trained on 873k images. By freezing the 307M-parameter backbone and training lightweight task adapters, the approach delivers state-of-the-art performance across 21 benchmarks spanning classification ($AUROC$), segmentation ($mDice$), and radiology report generation, while enabling strong out-of-domain generalization and bias auditing. The work highlights the advantages of self-supervised pretraining for patient-centric AI, offering interpretable attention maps and consistent performance on unseen populations and new diseases like COVID-19. Its demonstrated cross-population generalization, fairness analysis, and clinical applicability suggest significant potential for scalable radiology support in diverse settings, including low-resource regions and nonstandard exam distributions. Overall, RayDINO advances robust, versatile radiology AI by combining holistic imaging representations with minimal task-specific supervision and explicit interpretability.

Abstract

AI Foundation models are gaining traction in various applications, including medical fields like radiology. However, medical foundation models are often tested on limited tasks, leaving their generalisability and biases unexplored. We present RayDINO, a large visual encoder trained by self-supervision on 873k chest X-rays. We compare RayDINO to previous state-of-the-art models across nine radiology tasks, from classification and dense segmentation to text generation, and provide an in depth analysis of population, age and sex biases of our model. Our findings suggest that self-supervision allows patient-centric AI proving useful in clinical workflows and interpreting X-rays holistically. With RayDINO and small task-specific adapters, we reach state-of-the-art results and improve generalization to unseen populations while mitigating bias, illustrating the true promise of foundation models: versatility and robustness.

Advancing human-centric AI for robust X-ray analysis through holistic self-supervised learning

TL;DR

RayDINO addresses the need for robust, fair, and holistic chest X-ray analysis using a large self-supervised vision transformer trained on 873k images. By freezing the 307M-parameter backbone and training lightweight task adapters, the approach delivers state-of-the-art performance across 21 benchmarks spanning classification (), segmentation (), and radiology report generation, while enabling strong out-of-domain generalization and bias auditing. The work highlights the advantages of self-supervised pretraining for patient-centric AI, offering interpretable attention maps and consistent performance on unseen populations and new diseases like COVID-19. Its demonstrated cross-population generalization, fairness analysis, and clinical applicability suggest significant potential for scalable radiology support in diverse settings, including low-resource regions and nonstandard exam distributions. Overall, RayDINO advances robust, versatile radiology AI by combining holistic imaging representations with minimal task-specific supervision and explicit interpretability.

Abstract

AI Foundation models are gaining traction in various applications, including medical fields like radiology. However, medical foundation models are often tested on limited tasks, leaving their generalisability and biases unexplored. We present RayDINO, a large visual encoder trained by self-supervision on 873k chest X-rays. We compare RayDINO to previous state-of-the-art models across nine radiology tasks, from classification and dense segmentation to text generation, and provide an in depth analysis of population, age and sex biases of our model. Our findings suggest that self-supervision allows patient-centric AI proving useful in clinical workflows and interpreting X-rays holistically. With RayDINO and small task-specific adapters, we reach state-of-the-art results and improve generalization to unseen populations while mitigating bias, illustrating the true promise of foundation models: versatility and robustness.
Paper Structure (14 sections, 11 figures, 2 tables)

This paper contains 14 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: We introduce RayDINO, a foundation model for holistic and fairer analysis of chest X-rays.a) RayDINO is a 307M parameter vision transformer trained using the DINOv2 self-supervised objectives and applied as-is to all downstream tasks without any modification or specialization of its parameters. b) RayDINO is trained on four different datasets from the USA and Europe, comprising over 870k X-rays, and is tested on eleven datasets from seven countries across four continents in both internal and external settings. c) RayDINO is evaluated on tasks divided into three categories. d) RayDINO significantly outperforms all other models on 21 benchmarks and consistently delivers excellent performance (AUROC for Classification and Fairness, macro Accuracy for Explainability, mDice for Segmentation, AUROC for BRAX Generalization, and mDice for PAXRay Generalization, CheXbert vector similarity for Report Generation, Pearson's correlation for Unseen Exam, macro Accuracy for Unseen Disease, AUPRC for Rare Classes). All details about the method and implementation are available in the Methods section.
  • Figure 2: RayDINO's holistic analysis of chest X-rays.a) AUROC comparisons on 4 classification datasets from the USA and Vietnam including 38 different findings. b) mDice comparisons on 4 segmentation datasets from Japan, Vietnam, the USA, China and multiple other countries including 157 categories. c) Qualitative visualisation of RayDINO's prediction on four pneumothorax cases. 1st column: original image. 2nd column: interpretable attention maps showing where the classifier trained without pixel supervision is looking at. 3rd column segmentation prediction trained using pixel supervision. 4th column: radiologist ground-truth. d) Qualitative results for organ and bone segmentation on the PAXRay++ dataset. e) mDice segmentation comparisons macro-averaged per organs and bones. f) Report generation comparison including natural language processing metric, classification-based metric and radiologist-aligned metrics. g) A chest X-ray along with a radiologist report and two generated reports by RayDINO and UNIChest.
  • Figure 3: Generalization evaluation on rare or unseen diseases and on unseen exam.a) AUPRC comparisons of RayDINO against other models on long-tailed findings grouped by frequency: Common ($freq>10\%$), Uncommon ($1\%<freq<10\%$) and Rare ($freq<1\%$). b) Class distribution on the long-tail evaluations sorted by frequency. c) AUPRC comparisons on catheter malposition prediction. d) Macro Accuracy comparison on Normal vs Pneumothorax vs COVID-19 classification and ROC curves for the COVID-19 disease. e) Pearson's correlation coefficient comparison on Cobb angle regression using spinal exams on patients with scoliosis and the regression model's multi-head attention maps for explainability.
  • Figure 4: Interpretability and Fairness evaluation of RayDINO between different demographics.a) AUROC comparisons by training and validating a classifier on MIMIC or CheXpert (USA) and testing it on BRAX (Brazil). b) mDice comparisons by training and validating a segmentation head on PAXRay++, a synthetic dataset generated from 2D CT-scan projections (mainly the USA and China), and testing it on real world datasets JSRT (Japan) and Vindr Rib (Vietnam). c) AUROC fairness comparisons by training a classifier on one sex-specific split and testing it on the other split of NIH and CheXpert datasets. p-values obtained with a Mann-Whitney test to assess performance disparities when training with the same sex versus a different sex than the test set. d) AUROC fairness comparisons by training a classifier on one age-specific split and testing it on the two other splits of the MIMIC datasets. e) Macro Accuracy explainability comparisons by training one classifier for each class of Vindr and by looking if the position where the attention is looking the most match the radiologist bounding box on the test set. f) Attention maps comparison with radiologist reports to demonstrate the accurate localization of RayDINO's explainable classifiers (specific finding highlighted in bold in the report).
  • Figure S1: (top) Scaling Curve: we compare the impact of model's size on the aggregated results for multiple benchmarks. We report the best competitor model with a red dashed line as a reference. (bottom) Ablation on Populations from Training Data: we compare the impact of using patients from the USA only versus the USA and Europe when pretraining RayDINO.
  • ...and 6 more figures