Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context
Andrew Wang, Jiashuo Zhang, Michael Oberst
TL;DR
This study questions whether state‑of‑the‑art chest X‑ray (CXR) models truly offer clinical diagnostic value beyond contextual priors. By deriving a pre‑test probability from prior discharge summaries and stratifying, matching, and reweighting evaluations, the authors show that model performance degrades when contextual information is strong or decoupled from the current image. They demonstrate that prior notes can predict future CXR labels and that vision-model performance varies across context-defined subpopulations and between cases with and without prior mentions. The findings argue for context‑aware evaluation pipelines to reveal the true visual diagnostic contribution and to better align model assessment with clinical decision-making in longitudinal health records datasets.
Abstract
Public datasets of Chest X-Rays (CXRs) have long been a popular benchmark for developing machine learning (ML) computer vision models in healthcare. However, the reported strong average-case performance of these models do not necessarily reflect their actual utility when used in heterogeneous clinical settings, potentially masking weaker performance in medically significant scenarios. In this work we use clinical context to provide a more holistic evaluation of current "state-of-the-art" (SOTA) models for CXR diagnosis. In particular, we use discharge summaries, recorded prior to each CXR, to derive a "pre-CXR" probability of each CXR label, as a proxy for existing contextual knowledge available to clinicians when interpreting CXRs. We use this measure to probe model performance along two dimensions: First, using a stratified analysis, we show that models tend to have lower performance (as measured by AUROC and other metrics) among individuals with higher pre-CXR probability. Second, by controlling for pre-CXR probability via matching and re-weighting, we demonstrate that performance degrades when the correlation is broken between prior context and the current CXR label, suggesting that model performance may depend in part on inference of pre-CXR clinical context.
