Table of Contents
Fetching ...

A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

Chong Wang, Yabin Zhang, Yunhe Gao, Maya Varma, Clemence Mottez, Faidra Patsatzi, Jiaming Liu, Jin Long, Jean-Benoit Delbrouck, Sergios Gatidis, Akshay S. Chaudhari, Curtis P. Langlotz

TL;DR

It is demonstrated that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement and offer practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.

Abstract

Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a "scale-at-all-costs" paradigm. However, this strategy faces two critical challenges: large-scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over-represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement. We introduce CheXficient, a chest X-ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget, yet achieving comparable or superior performance to its full-data counterpart and other large-scale pretrained models. We assess CheXficient across 20 individual benchmarks spanning 5 task types, including non-adapted off-the-shelf evaluations (zero-shot findings classification and crossmodal retrieval) and adapted downstream tasks (disease prediction, semantic segmentation, and radiology report generation). Further analyses show that CheXficient systematically prioritizes under-represented training samples, improving generalizability on long-tailed or rare conditions. Overall, our work offers practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.

A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

TL;DR

It is demonstrated that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement and offer practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.

Abstract

Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a "scale-at-all-costs" paradigm. However, this strategy faces two critical challenges: large-scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over-represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement. We introduce CheXficient, a chest X-ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget, yet achieving comparable or superior performance to its full-data counterpart and other large-scale pretrained models. We assess CheXficient across 20 individual benchmarks spanning 5 task types, including non-adapted off-the-shelf evaluations (zero-shot findings classification and crossmodal retrieval) and adapted downstream tasks (disease prediction, semantic segmentation, and radiology report generation). Further analyses show that CheXficient systematically prioritizes under-represented training samples, improving generalizability on long-tailed or rare conditions. Overall, our work offers practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.
Paper Structure (37 sections, 21 figures, 4 tables, 1 algorithm)

This paper contains 37 sections, 21 figures, 4 tables, 1 algorithm.

Figures (21)

  • Figure 1: Overview of this study. (a) CheXficient pretraining strategy. Paired chest X-ray (CXR) images and radiology reports are employed. A data curator selectively determines which image–report pairs are included in training, followed by optimization with an InfoNCE oord2018representation contrastive loss. (b) Data curation mechanism. The data curator prioritizes image–report pairs based on their proximity to a set of learnable prototypes. Samples farther from the prototypes are assigned higher priority, while more redundant data closer to the prototypes is down-sampled. (c) Evaluation protocol. After pretraining, CheXficient is evaluated on non-adapted tasks without any architectural or weight modifications, including (i) zero-shot finding classification and (ii) zero-shot cross-modal retrieval. CheXficient can also be adapted to downstream tasks via fine-tuning, including (iii) multi-class disease prediction, (iv) semantic segmentation, and (v) radiology report generation. (d) Data efficiency analysis. Zero-shot findings classification performance (average AUROC on 8 datasets) versus pretraining data size compared with alternative pretraining strategies and models. (e) Overall performance. CheXficient achieves better or comparable performances with full-data pretraining, while greatly outperforming other existing large-scale pretrained models on diverse evaluation benchmarks spanning 5 task types: retrieval (in Recall@1), segmentation (in Dice score), and report generation (in RadGraph), adapted and non-adapted classification (in AUROC).
  • Figure 1: Performance (AUPRC) on zero-shot findings classification. We evaluate the pretrained models on 8 public datasets. Among them, SIIM-PTX, Pneumonia2017, and TBX11K are from external domains unseen in pretraining, while the remaining 5 datasets are used for internal evaluation. Compared to CheXfull, CheXficient achieves higher AUROC on 3 datasets ($p < 0.05$), and comparable performance on 5 datasets ($p >$ 0.05). For each dataset, CheXficient outperforms CheXrandom by large margins. We present the mean $\pm$ 95% confidence interval (CI) of AUPRC for each model. The listed $p$-values are calculated using two-sided $t$-tests.
  • Figure 2: (a) Feature distribution of training samples from the full set (CheXfull, $n$ = 1,235K), random subset (CheXrandom, $n$ = 280K), and curated subset (CheXficient, $n$ = 280K). Qualitative histograms of feature embeddings projected onto a two-dimensional PCA space (Colorbars are independently normalized to reveal the structural organization of the PCA space, absolute density values are therefore not directly comparable across methods). The curated subset occupies distinct and long-tailed regions in the feature space, compared with both the full and random subsets. (b) Quantitative analysis of local feature density (left) for samples from the full set, random subset, and curated subset, measured by the average $k$-nearest neighbor (kNN) distance ($k=20$) computed in the raw feature space. $p$-values are reported from Welch’s t-test by comparing each subset against the full set. The low-density proportion (right) denotes the fraction of samples falling within the top 25% lowest-density regions of the full dataset. (c) CDF plot of the average kNN distance. Curated samples (red) are shifted toward higher kNN distances, indicating enrichment in low-density regions, whereas the random subset (purple) closely overlaps with the full set (yellow).
  • Figure 2: Performance on zero-shot cross-modal retrieval (Image $\rightarrow$ Findings and Findings $\rightarrow$ Image). We evaluate the pretrained models on the public CheXpert and MIMIC-CXR benchmarks. CheXficient achieves performance comparable to CheXfull on both datasets ($p > 0.05$), while outperforming CheXrandom. We report the mean $\pm$ 95% CI of Recall@1 (The recall of retrieving the exact paired CXR Findings section (or image) within the top-1 result for a given CXR image (or Findings section)).
  • Figure 3: Performance on zero-shot findings classification. We evaluate the pretrained models on 8 public datasets. Among them, SIIM-PTX, Pneumonia2017, and TBX11K are from external domains unseen in pretraining, while the remaining 5 datasets are used for internal evaluation. Compared to CheXfull, CheXficient achieves higher AUROC on three datasets ($p < 0.05$), comparable performance on four datasets ($p > 0.05$), and lower performance on one dataset ($p < 0.05$). Across all datasets, CheXficient achieves higher AUROC than CheXrandom. We present the mean $\pm$ 95% confidence interval (CI) of AUROC for each model. The listed $p$-values are computed using two-sided $t$-tests.
  • ...and 16 more figures