Table of Contents
Fetching ...

Multi Anatomy X-Ray Foundation Model

Nishank Singla, Krisztian Koos, Farzin Haddadpour, Amin Honarmandi Shandiz, Lovish Chum, Xiaojian Xu, Qing Jin, Erhan Bas

TL;DR

This work addresses the limited generalization of chest-focused radiology models by introducing XR-0, a multi-anatomy X-ray foundation model trained with self-supervised learning on a large, diverse dataset. Built on a ViT-B backbone with image-level and patch-level objectives, XR-0 is evaluated across 12 datasets and 20 tasks, including retrieval, classification, segmentation, localization, visual grounding, and report generation, achieving state-of-the-art results on multi-anatomy benchmarks. A companion chest-specific model, CXR-0, and a multimodal extension, mXR-0, demonstrate that data diversity and text supervision further boost performance in generative tasks, such as radiology report generation. Overall, the results underscore anatomical diversity as a key driver of robust generalization in radiology AI, enabling scalable and adaptable clinical workflows while highlighting ongoing considerations for fairness and task-specific performance.

Abstract

X-ray imaging is a ubiquitous in radiology, yet most existing AI foundation models are limited to chest anatomy and fail to generalize across broader clinical tasks. In this work, we introduce XR-0, the multi-anatomy X-ray foundation model using self-supervised learning on a large, private dataset of 1.15 million images spanning diverse anatomical regions and evaluated across 12 datasets and 20 downstream tasks, including classification, retrieval, segmentation, localization, visual grounding, and report generation. XR-0 achieves state-of-the-art performance on most multi-anatomy tasks and remains competitive on chest-specific benchmarks. Our results demonstrate that anatomical diversity and supervision are critical for building robust, general-purpose medical vision models, paving the way for scalable and adaptable AI systems in radiology.

Multi Anatomy X-Ray Foundation Model

TL;DR

This work addresses the limited generalization of chest-focused radiology models by introducing XR-0, a multi-anatomy X-ray foundation model trained with self-supervised learning on a large, diverse dataset. Built on a ViT-B backbone with image-level and patch-level objectives, XR-0 is evaluated across 12 datasets and 20 tasks, including retrieval, classification, segmentation, localization, visual grounding, and report generation, achieving state-of-the-art results on multi-anatomy benchmarks. A companion chest-specific model, CXR-0, and a multimodal extension, mXR-0, demonstrate that data diversity and text supervision further boost performance in generative tasks, such as radiology report generation. Overall, the results underscore anatomical diversity as a key driver of robust generalization in radiology AI, enabling scalable and adaptable clinical workflows while highlighting ongoing considerations for fairness and task-specific performance.

Abstract

X-ray imaging is a ubiquitous in radiology, yet most existing AI foundation models are limited to chest anatomy and fail to generalize across broader clinical tasks. In this work, we introduce XR-0, the multi-anatomy X-ray foundation model using self-supervised learning on a large, private dataset of 1.15 million images spanning diverse anatomical regions and evaluated across 12 datasets and 20 downstream tasks, including classification, retrieval, segmentation, localization, visual grounding, and report generation. XR-0 achieves state-of-the-art performance on most multi-anatomy tasks and remains competitive on chest-specific benchmarks. Our results demonstrate that anatomical diversity and supervision are critical for building robust, general-purpose medical vision models, paving the way for scalable and adaptable AI systems in radiology.

Paper Structure

This paper contains 34 sections, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Overview of the XR-0 multi-anatomy pretrained model. (a) The model is pretrained using both image-level and patch-level objectives. It is evaluated on a wide range of downstream tasks, including image retrieval, classification, segmentation, report generation, and visual grounding. (b) Distribution of anatomical regions in the pretraining dataset. A total of 1.15 million images are used after filtering duplicates and low-quality samples. (c) Twelve datasets are used to benchmark model performance across diverse downstream tasks. (d) Result summary. XR-0 demonstrate an average convergence speedup of 4.9× on CheXpert, PTX, and QC linear classification tasks compared to DINOv2.
  • Figure 2: Image retrieval examples from the dXR Anatomy dataset. While all models retrieve relevant images based on the target class, domain-specific models often return results that are more consistent in secondary visual attributes such as texture, orientation, or anatomical presentation.
  • Figure 3: Qualitative segmentation results using the XR-0 model with a UPerNet decoder. Only the decoder is trained; the backbone remains frozen.
  • Figure 4: Fairness evaluation. (a) Results for sex-specific subgroups. (b) Fairness comparison across age group splits.
  • Figure A1: An example of an original unprocessed report and the final processed output used for the pretraining of multimodal model mXR-0.
  • ...and 6 more figures