SpurBreast: A Curated Dataset for Investigating Spurious Correlations in Real-world Breast MRI Classification

Jong Bum Won; Wesley De Neve; Joris Vankerschaver; Utku Ozbulak

SpurBreast: A Curated Dataset for Investigating Spurious Correlations in Real-world Breast MRI Classification

Jong Bum Won, Wesley De Neve, Joris Vankerschaver, Utku Ozbulak

TL;DR

SpurBreast tackles the problem of spurious correlations in breast MRI classification by providing a curated dataset with explicit biases, notably magnetic field strength and vertical orientation. It demonstrates that DNNs can leverage non-clinical cues to achieve high validation performance while failing to generalize to unbiased test data, highlighting a critical generalization gap. The study uses two architectures (ResNet-50 and ViT-B/16) and controlled data splits to quantify the impact of biased features, offering both biased and unbiased benchmarks for robust evaluation. The dataset and code are publicly available to help the community develop uncertainty-aware and bias-mitigation strategies for medical imaging models.

Abstract

Deep neural networks (DNNs) have demonstrated remarkable success in medical imaging, yet their real-world deployment remains challenging due to spurious correlations, where models can learn non-clinical features instead of meaningful medical patterns. Existing medical imaging datasets are not designed to systematically study this issue, largely due to restrictive licensing and limited supplementary patient data. To address this gap, we introduce SpurBreast, a curated breast MRI dataset that intentionally incorporates spurious correlations to evaluate their impact on model performance. Analyzing over 100 features involving patient, device, and imaging protocol, we identify two dominant spurious signals: magnetic field strength (a global feature influencing the entire image) and image orientation (a local feature affecting spatial alignment). Through controlled dataset splits, we demonstrate that DNNs can exploit these non-clinical signals, achieving high validation accuracy while failing to generalize to unbiased test data. Alongside these two datasets containing spurious correlations, we also provide benchmark datasets without spurious correlations, allowing researchers to systematically investigate clinically relevant and irrelevant features, uncertainty estimation, adversarial robustness, and generalization strategies. Models and datasets are available at https://github.com/utkuozbulak/spurbreast.

SpurBreast: A Curated Dataset for Investigating Spurious Correlations in Real-world Breast MRI Classification

TL;DR

Abstract

SpurBreast: A Curated Dataset for Investigating Spurious Correlations in Real-world Breast MRI Classification

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)