Exploring Patient Data Requirements in Training Effective AI Models for MRI-based Breast Cancer Classification
Solha Kang, Wesley De Neve, Francois Rameau, Utku Ozbulak
TL;DR
The paper addresses how much patient data is needed to train effective MRI-based breast cancer detectors. It leverages Vision Transformer–based foundation models (ViT-B/16) with transfer learning from self-supervised (DINO, MAE) and supervised pretraining on ImageNet, applied to the Duke Breast Cancer Dataset. Key findings show that state-of-the-art performance can be achieved with as few as 10–50 patients, with diminishing returns thereafter, and that pretraining choice has limited impact while simple ensemble methods provide additional gains. These results suggest that medical institutions can deploy competitive, data-efficient AI in-house for breast cancer MRI classification, though broader validation across diverse datasets is warranted.
Abstract
The past decade has witnessed a substantial increase in the number of startups and companies offering AI-based solutions for clinical decision support in medical institutions. However, the critical nature of medical decision-making raises several concerns about relying on external software. Key issues include potential variations in image modalities and the medical devices used to obtain these images, potential legal issues, and adversarial attacks. Fortunately, the open-source nature of machine learning research has made foundation models publicly available and straightforward to use for medical applications. This accessibility allows medical institutions to train their own AI-based models, thereby mitigating the aforementioned concerns. Given this context, an important question arises: how much data do medical institutions need to train effective AI models? In this study, we explore this question in relation to breast cancer detection, a particularly contested area due to the prevalence of this disease, which affects approximately 1 in every 8 women. Through large-scale experiments on various patient sizes in the training set, we show that medical institutions do not need a decade's worth of MRI images to train an AI model that performs competitively with the state-of-the-art, provided the model leverages foundation models. Furthermore, we observe that for patient counts greater than 50, the number of patients in the training set has a negligible impact on the performance of models and that simple ensembles further improve the results without additional complexity.
