Table of Contents
Fetching ...

SpurBreast: A Curated Dataset for Investigating Spurious Correlations in Real-world Breast MRI Classification

Jong Bum Won, Wesley De Neve, Joris Vankerschaver, Utku Ozbulak

TL;DR

SpurBreast tackles the problem of spurious correlations in breast MRI classification by providing a curated dataset with explicit biases, notably magnetic field strength and vertical orientation. It demonstrates that DNNs can leverage non-clinical cues to achieve high validation performance while failing to generalize to unbiased test data, highlighting a critical generalization gap. The study uses two architectures (ResNet-50 and ViT-B/16) and controlled data splits to quantify the impact of biased features, offering both biased and unbiased benchmarks for robust evaluation. The dataset and code are publicly available to help the community develop uncertainty-aware and bias-mitigation strategies for medical imaging models.

Abstract

Deep neural networks (DNNs) have demonstrated remarkable success in medical imaging, yet their real-world deployment remains challenging due to spurious correlations, where models can learn non-clinical features instead of meaningful medical patterns. Existing medical imaging datasets are not designed to systematically study this issue, largely due to restrictive licensing and limited supplementary patient data. To address this gap, we introduce SpurBreast, a curated breast MRI dataset that intentionally incorporates spurious correlations to evaluate their impact on model performance. Analyzing over 100 features involving patient, device, and imaging protocol, we identify two dominant spurious signals: magnetic field strength (a global feature influencing the entire image) and image orientation (a local feature affecting spatial alignment). Through controlled dataset splits, we demonstrate that DNNs can exploit these non-clinical signals, achieving high validation accuracy while failing to generalize to unbiased test data. Alongside these two datasets containing spurious correlations, we also provide benchmark datasets without spurious correlations, allowing researchers to systematically investigate clinically relevant and irrelevant features, uncertainty estimation, adversarial robustness, and generalization strategies. Models and datasets are available at https://github.com/utkuozbulak/spurbreast.

SpurBreast: A Curated Dataset for Investigating Spurious Correlations in Real-world Breast MRI Classification

TL;DR

SpurBreast tackles the problem of spurious correlations in breast MRI classification by providing a curated dataset with explicit biases, notably magnetic field strength and vertical orientation. It demonstrates that DNNs can leverage non-clinical cues to achieve high validation performance while failing to generalize to unbiased test data, highlighting a critical generalization gap. The study uses two architectures (ResNet-50 and ViT-B/16) and controlled data splits to quantify the impact of biased features, offering both biased and unbiased benchmarks for robust evaluation. The dataset and code are publicly available to help the community develop uncertainty-aware and bias-mitigation strategies for medical imaging models.

Abstract

Deep neural networks (DNNs) have demonstrated remarkable success in medical imaging, yet their real-world deployment remains challenging due to spurious correlations, where models can learn non-clinical features instead of meaningful medical patterns. Existing medical imaging datasets are not designed to systematically study this issue, largely due to restrictive licensing and limited supplementary patient data. To address this gap, we introduce SpurBreast, a curated breast MRI dataset that intentionally incorporates spurious correlations to evaluate their impact on model performance. Analyzing over 100 features involving patient, device, and imaging protocol, we identify two dominant spurious signals: magnetic field strength (a global feature influencing the entire image) and image orientation (a local feature affecting spatial alignment). Through controlled dataset splits, we demonstrate that DNNs can exploit these non-clinical signals, achieving high validation accuracy while failing to generalize to unbiased test data. Alongside these two datasets containing spurious correlations, we also provide benchmark datasets without spurious correlations, allowing researchers to systematically investigate clinically relevant and irrelevant features, uncertainty estimation, adversarial robustness, and generalization strategies. Models and datasets are available at https://github.com/utkuozbulak/spurbreast.

Paper Structure

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: (a) A side profile diagram of the breast, highlighting the imaging region. Slices in the red area contain MRI images with breast tumors, slices in the yellow area are buffer zones and are not used, and slices in the white region do not contain invasive breast tumors. (b) Example MRI slices obtained from the specified cross-sectional region. Image with the highlighted red box in one slice indicates an invasive breast tumor.
  • Figure 2: Illustration of the dataset creation process for discovering spurious correlations. (a) A typical patient-based random sampling approach, where the dataset is split into training, validation, and test sets to prevent overlap and ensure unbiased evaluations. (b) A modified sampling strategy where specific spurious correlations between predictive labels (tumor-positive and tumor-negative) and supplementary features (e.g., ethnicity) are deliberately introduced to study their effects on model performance.
  • Figure 3: Example breast MRI images obtained using (a) 1.5T and (b) 3T devices.