Table of Contents
Fetching ...

Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context

Andrew Wang, Jiashuo Zhang, Michael Oberst

TL;DR

This study questions whether state‑of‑the‑art chest X‑ray (CXR) models truly offer clinical diagnostic value beyond contextual priors. By deriving a pre‑test probability from prior discharge summaries and stratifying, matching, and reweighting evaluations, the authors show that model performance degrades when contextual information is strong or decoupled from the current image. They demonstrate that prior notes can predict future CXR labels and that vision-model performance varies across context-defined subpopulations and between cases with and without prior mentions. The findings argue for context‑aware evaluation pipelines to reveal the true visual diagnostic contribution and to better align model assessment with clinical decision-making in longitudinal health records datasets.

Abstract

Public datasets of Chest X-Rays (CXRs) have long been a popular benchmark for developing machine learning (ML) computer vision models in healthcare. However, the reported strong average-case performance of these models do not necessarily reflect their actual utility when used in heterogeneous clinical settings, potentially masking weaker performance in medically significant scenarios. In this work we use clinical context to provide a more holistic evaluation of current "state-of-the-art" (SOTA) models for CXR diagnosis. In particular, we use discharge summaries, recorded prior to each CXR, to derive a "pre-CXR" probability of each CXR label, as a proxy for existing contextual knowledge available to clinicians when interpreting CXRs. We use this measure to probe model performance along two dimensions: First, using a stratified analysis, we show that models tend to have lower performance (as measured by AUROC and other metrics) among individuals with higher pre-CXR probability. Second, by controlling for pre-CXR probability via matching and re-weighting, we demonstrate that performance degrades when the correlation is broken between prior context and the current CXR label, suggesting that model performance may depend in part on inference of pre-CXR clinical context.

Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context

TL;DR

This study questions whether state‑of‑the‑art chest X‑ray (CXR) models truly offer clinical diagnostic value beyond contextual priors. By deriving a pre‑test probability from prior discharge summaries and stratifying, matching, and reweighting evaluations, the authors show that model performance degrades when contextual information is strong or decoupled from the current image. They demonstrate that prior notes can predict future CXR labels and that vision-model performance varies across context-defined subpopulations and between cases with and without prior mentions. The findings argue for context‑aware evaluation pipelines to reveal the true visual diagnostic contribution and to better align model assessment with clinical decision-making in longitudinal health records datasets.

Abstract

Public datasets of Chest X-Rays (CXRs) have long been a popular benchmark for developing machine learning (ML) computer vision models in healthcare. However, the reported strong average-case performance of these models do not necessarily reflect their actual utility when used in heterogeneous clinical settings, potentially masking weaker performance in medically significant scenarios. In this work we use clinical context to provide a more holistic evaluation of current "state-of-the-art" (SOTA) models for CXR diagnosis. In particular, we use discharge summaries, recorded prior to each CXR, to derive a "pre-CXR" probability of each CXR label, as a proxy for existing contextual knowledge available to clinicians when interpreting CXRs. We use this measure to probe model performance along two dimensions: First, using a stratified analysis, we show that models tend to have lower performance (as measured by AUROC and other metrics) among individuals with higher pre-CXR probability. Second, by controlling for pre-CXR probability via matching and re-weighting, we demonstrate that performance degrades when the correlation is broken between prior context and the current CXR label, suggesting that model performance may depend in part on inference of pre-CXR clinical context.

Paper Structure

This paper contains 41 sections, 2 theorems, 27 equations, 8 figures, 35 tables.

Key Result

Proposition 3.1

For an observation $(x, y, c)$, let the importance weight $w(y, c)$ be defined as the ratio of the marginal disease prevalence to the context-conditional probability: Then, given a distribution $P$ and a corresponding distribution $Q$ (as defined in Def. def:target_dist), where $Q(X, Y, C) > 0 \implies P(X, Y, C) > 0$, then for any measurable function $g(X, Y, C)$:

Figures (8)

  • Figure 1: Overview of our evaluation framework (a): We use clinical context contained within discharge summaries to derive a pre-test probability estimate of disease risk, given knowledge obtained before a CXR is ordered. (b): We identify subpopulations of the evaluation using context from prior notes (such as pre-test probability, or prior mention of the disease label). We then evaluate performance on the resulting disjoint groups. (c): We create an evaluation set by matching positive/negative image pairs with similar context-derived pre-test probabilities. We then compare vision model performance on this balanced test set versus the original test set (d): We statistically decouple the image label from the contextual information via reweighting the evaluation set. By isolating the label from potentially predictive contextual information, we obtain the vision model's true diagnostic ability using non-context redundant visual features.
  • Figure 2: Visualization of "prior clinical context" used in this paper. For a given CXR study, we use all discharge summaries from admissions that occur before the current admission (blue). We do not use any information (discharge summaries, radiology reports, or otherwise) from the current or any future admission (red).
  • Figure 3: Details of Matched Neighbors vs Reweighted Evaluation Set (a): Original evaluation set: Negatively and positively labeled CXRs in the original evaluation set with their associated distributions of pre-test probabilities. All images and their associated pre-test probabilities are equally weighted. (b): Matched Neighbors set: Our procedure matches positive and negative examples 1-to-1 based on pre-test probability. In practice, given low prevalence of positive labels, this procedure tends to retain the subset of negatively labeled CXRs whose associated pre-test probability distribution more closely resembles that of the positively labeled CXRs, which are unchanged. (c): Reweighted set: After reweighting of images and their associated pre-test probabilities, the weighted distribution of pre-test probabilities become comparable across the positive and negative reweighted populations.
  • Figure 4: Held-out performance (in terms of AUROC) of CXR models across sub-populations stratified by pre-test probability (LM embeddings) of the CXR label. Vision model performance generally degrades across most labels as the prior probability of the label increases. Labels marked with an asterisk (*) indicate a statistically significant difference in AUROC between the Bottom 25% and Top 25% groups. The 95% confidence intervals in parentheses were calculated using percentile bootstrapping as described in section \ref{['subsec:Stratified and Matched Analyses']}.
  • Figure 5: Comparison of AUROC across Standard, Reweighted (IPW), and Matched Neighbor settings. Metrics reported are mean (95% CI). Labels marked with a dagger ($\dagger$) and an asterisk (*) indicate statistically significant differences in the Reweighted (IPW) and Matched settings, respectively, compared to the Standard setting.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Definition 3.1: Reweighting Distributions
  • Proposition 3.1: Expected Equivalence
  • Lemma E.1
  • proof
  • proof
  • proof