Source Matters: Source Dataset Impact on Model Robustness in Medical Imaging

Dovile Juodelyte; Yucheng Lu; Amelia Jiménez-Sánchez; Sabrina Bottazzi; Enzo Ferrante; Veronika Cheplygina

Source Matters: Source Dataset Impact on Model Robustness in Medical Imaging

Dovile Juodelyte, Yucheng Lu, Amelia Jiménez-Sánchez, Sabrina Bottazzi, Enzo Ferrante, Veronika Cheplygina

TL;DR

This paper tackles how the pretraining source dataset domain affects generalization in medical imaging, arguing that cross-domain transfer can foster shortcut learning. It introduces MICCAT, a taxonomy of contextualized confounders, and a principled experimental design that compares ImageNet versus RadImageNet under controlled confounders on chest X-ray and CT tasks. The key finding is that RadImageNet matches ImageNet in i.i.d. performance but is more robust to out-of-distribution confounders, indicating that source domain selection critically shapes robustness beyond accuracy. The work advocates for confounder-aware evaluation of transfer learning in clinical settings and provides public code to enable broader, rigorous assessments of model robustness.

Abstract

Transfer learning has become an essential part of medical imaging classification algorithms, often leveraging ImageNet weights. The domain shift from natural to medical images has prompted alternatives such as RadImageNet, often showing comparable classification performance. However, it remains unclear whether the performance gains from transfer learning stem from improved generalization or shortcut learning. To address this, we conceptualize confounders by introducing the Medical Imaging Contextualized Confounder Taxonomy (MICCAT) and investigate a range of confounders across it -- whether synthetic or sampled from the data -- using two public chest X-ray and CT datasets. We show that ImageNet and RadImageNet achieve comparable classification performance, yet ImageNet is much more prone to overfitting to confounders. We recommend that researchers using ImageNet-pretrained models reexamine their model robustness by conducting similar experiments. Our code and experiments are available at https://github.com/DovileDo/source-matters.

Source Matters: Source Dataset Impact on Model Robustness in Medical Imaging

TL;DR

Abstract

Paper Structure (8 sections, 2 equations, 4 figures, 1 table)

This paper contains 8 sections, 2 equations, 4 figures, 1 table.

Introduction
Method
MICCAT: towards a standardized taxonomy for medical imaging confounders
Experimental Design
Results and Discussion
Conclusion
Acknowledgments.
Disclosure of Interests.

Figures (4)

Figure 1: MICCAT: Medical Imaging Contextualized Confounder Taxonomy. Instances of confounders investigated in this paper are highlighted in bold.
Figure 2: Synthetic artifacts: (a) A tag with a red arrow for reference, (b) a zoomed-in view of the original image, (c) Denoising by low-pass filter with cutoff frequency (see Eq. \ref{['eq:low']}) of $D_0 = 200$px, and (d) Poisson noise with $N_{0}= 2 \times 10^{6}$ (see Eq. \ref{['eq:noise']}). The parameters used here are to emphasize subtle local variations such as the smoothing effect of the low-pass filter and the graininess introduced by the Poisson noise. For our experiments, we use $D_0 = 500$px and $N_{0}= 2\times 10^{7}$ which are imperceptible.
Figure 3: Mean AUC across five-fold cross-validation with 95% CI for lung mass (left and middle) and atelectasis (right) prediction in chest X-rays. Increasing correlation between artifact (tag, denoising, gender) and the label leads to lower o.o.d. AUC (on o.o.d. test set as described in Sec. \ref{['sec:design']}) (top row), while i.i.d. AUC increases (bottom row). RadImageNet pretraining shows less degradation in o.o.d. AUC compared to ImageNet pretraining, suggesting that ImageNet may over-rely on spurious correlations in the target dataset. The grey dotted line is the SOTA result for lung mass and atelectasis in NIH CXR14 reported by dai2024unichest.
Figure 4: O.o.d. AUC (mean and 95% CI across five-folds) for lung mass prediction in chest X-rays and CTs. In X-rays (top), both ImageNet and RadImageNet show similar reliance on Poisson noise. However, RadImageNet is more robust in CT scans (bottom). When the confounder is high vs low noise, both ImageNet and RadImageNet are less sensitive (right), compared to noise vs no noise (left).

Source Matters: Source Dataset Impact on Model Robustness in Medical Imaging

TL;DR

Abstract

Source Matters: Source Dataset Impact on Model Robustness in Medical Imaging

Authors

TL;DR

Abstract

Table of Contents

Figures (4)