Table of Contents
Fetching ...

Benchmarking CXR Foundation Models With Publicly Available MIMIC-CXR and NIH-CXR14 Datasets

Jiho Shin, Dominic Marshall, Matthieu Komorowski

TL;DR

This study benchmarked two large-scale chest X-ray embedding models, CXR-Foundation ELIXR v2.0 and MedImageInsight, on public datasets MIMIC-CXR and NIH ChestX-ray14 using a unified preprocessing and downstream-lightweight classifier pipeline. MedImageInsight generally achieved higher task performance while CXR-Foundation showed robust cross-dataset stability, and unsupervised clustering revealed coherent disease-specific structure in the embeddings. The work highlights the importance of standardized evaluation for medical foundation models and provides reproducible baselines for multimodal fusion and clinical integration. The findings suggest that carefully balanced, high-dimensional embeddings can support scalable radiology analytics, with attention to dataset diversity and generalization.

Abstract

Recent foundation models have demonstrated strong performance in medical image representation learning, yet their comparative behaviour across datasets remains underexplored. This work benchmarks two large-scale chest X-ray (CXR) embedding models (CXR-Foundation (ELIXR v2.0) and MedImagelnsight) on public MIMIC-CR and NIH ChestX-ray14 datasets. Each model was evaluated using a unified preprocessing pipeline and fixed downstream classifiers to ensure reproducible comparison. We extracted embeddings directly from pre-trained encoders, trained lightweight LightGBM classifiers on multiple disease labels, and reported mean AUROC, and F1-score with 95% confidence intervals. MedImageInsight achieved slightly higher performance across most tasks, while CXR-Foundation exhibited strong cross-dataset stability. Unsupervised clustering of MedImageIn-sight embeddings further revealed a coherent disease-specific structure consistent with quantitative results. The results highlight the need for standardised evaluation of medical foundation models and establish reproducible baselines for future multimodal and clinical integration studies.

Benchmarking CXR Foundation Models With Publicly Available MIMIC-CXR and NIH-CXR14 Datasets

TL;DR

This study benchmarked two large-scale chest X-ray embedding models, CXR-Foundation ELIXR v2.0 and MedImageInsight, on public datasets MIMIC-CXR and NIH ChestX-ray14 using a unified preprocessing and downstream-lightweight classifier pipeline. MedImageInsight generally achieved higher task performance while CXR-Foundation showed robust cross-dataset stability, and unsupervised clustering revealed coherent disease-specific structure in the embeddings. The work highlights the importance of standardized evaluation for medical foundation models and provides reproducible baselines for multimodal fusion and clinical integration. The findings suggest that carefully balanced, high-dimensional embeddings can support scalable radiology analytics, with attention to dataset diversity and generalization.

Abstract

Recent foundation models have demonstrated strong performance in medical image representation learning, yet their comparative behaviour across datasets remains underexplored. This work benchmarks two large-scale chest X-ray (CXR) embedding models (CXR-Foundation (ELIXR v2.0) and MedImagelnsight) on public MIMIC-CR and NIH ChestX-ray14 datasets. Each model was evaluated using a unified preprocessing pipeline and fixed downstream classifiers to ensure reproducible comparison. We extracted embeddings directly from pre-trained encoders, trained lightweight LightGBM classifiers on multiple disease labels, and reported mean AUROC, and F1-score with 95% confidence intervals. MedImageInsight achieved slightly higher performance across most tasks, while CXR-Foundation exhibited strong cross-dataset stability. Unsupervised clustering of MedImageIn-sight embeddings further revealed a coherent disease-specific structure consistent with quantitative results. The results highlight the need for standardised evaluation of medical foundation models and establish reproducible baselines for future multimodal and clinical integration studies.

Paper Structure

This paper contains 12 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: UMAP visualisation of MedImageInsight embeddings for the highest- and lowest-performing disease labels on MIMIC dataset: (a) Effusion shows distinct separation between positive and negative samples, consistent with its high AUROC, while (b) Opacity displays mixed clustering, reflecting lower discriminative power. $0$ indicates absence and $1$ indicates presence of disease label.