Table of Contents
Fetching ...

Exploring Patient Data Requirements in Training Effective AI Models for MRI-based Breast Cancer Classification

Solha Kang, Wesley De Neve, Francois Rameau, Utku Ozbulak

TL;DR

The paper addresses how much patient data is needed to train effective MRI-based breast cancer detectors. It leverages Vision Transformer–based foundation models (ViT-B/16) with transfer learning from self-supervised (DINO, MAE) and supervised pretraining on ImageNet, applied to the Duke Breast Cancer Dataset. Key findings show that state-of-the-art performance can be achieved with as few as 10–50 patients, with diminishing returns thereafter, and that pretraining choice has limited impact while simple ensemble methods provide additional gains. These results suggest that medical institutions can deploy competitive, data-efficient AI in-house for breast cancer MRI classification, though broader validation across diverse datasets is warranted.

Abstract

The past decade has witnessed a substantial increase in the number of startups and companies offering AI-based solutions for clinical decision support in medical institutions. However, the critical nature of medical decision-making raises several concerns about relying on external software. Key issues include potential variations in image modalities and the medical devices used to obtain these images, potential legal issues, and adversarial attacks. Fortunately, the open-source nature of machine learning research has made foundation models publicly available and straightforward to use for medical applications. This accessibility allows medical institutions to train their own AI-based models, thereby mitigating the aforementioned concerns. Given this context, an important question arises: how much data do medical institutions need to train effective AI models? In this study, we explore this question in relation to breast cancer detection, a particularly contested area due to the prevalence of this disease, which affects approximately 1 in every 8 women. Through large-scale experiments on various patient sizes in the training set, we show that medical institutions do not need a decade's worth of MRI images to train an AI model that performs competitively with the state-of-the-art, provided the model leverages foundation models. Furthermore, we observe that for patient counts greater than 50, the number of patients in the training set has a negligible impact on the performance of models and that simple ensembles further improve the results without additional complexity.

Exploring Patient Data Requirements in Training Effective AI Models for MRI-based Breast Cancer Classification

TL;DR

The paper addresses how much patient data is needed to train effective MRI-based breast cancer detectors. It leverages Vision Transformer–based foundation models (ViT-B/16) with transfer learning from self-supervised (DINO, MAE) and supervised pretraining on ImageNet, applied to the Duke Breast Cancer Dataset. Key findings show that state-of-the-art performance can be achieved with as few as 10–50 patients, with diminishing returns thereafter, and that pretraining choice has limited impact while simple ensemble methods provide additional gains. These results suggest that medical institutions can deploy competitive, data-efficient AI in-house for breast cancer MRI classification, though broader validation across diverse datasets is warranted.

Abstract

The past decade has witnessed a substantial increase in the number of startups and companies offering AI-based solutions for clinical decision support in medical institutions. However, the critical nature of medical decision-making raises several concerns about relying on external software. Key issues include potential variations in image modalities and the medical devices used to obtain these images, potential legal issues, and adversarial attacks. Fortunately, the open-source nature of machine learning research has made foundation models publicly available and straightforward to use for medical applications. This accessibility allows medical institutions to train their own AI-based models, thereby mitigating the aforementioned concerns. Given this context, an important question arises: how much data do medical institutions need to train effective AI models? In this study, we explore this question in relation to breast cancer detection, a particularly contested area due to the prevalence of this disease, which affects approximately 1 in every 8 women. Through large-scale experiments on various patient sizes in the training set, we show that medical institutions do not need a decade's worth of MRI images to train an AI model that performs competitively with the state-of-the-art, provided the model leverages foundation models. Furthermore, we observe that for patient counts greater than 50, the number of patients in the training set has a negligible impact on the performance of models and that simple ensembles further improve the results without additional complexity.

Paper Structure

This paper contains 7 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Visualization of the ViT architecture and image patch tokenization.
  • Figure 2: An example set of images from the Duke breast MRI dataset: (a) Tumor-positive breast MRI images, overlaid with bounding boxes indicating tumor locations, and (b) Tumor-negative breast MRI images.
  • Figure 3: Overview of the generation of training splits. A fixed validation set of 200 patients is randomly sampled from a total of 900 patients. From the remaining 700 patients, 10 training splits of $n \in \{1, 5, 10, 50, 100, 200, 400, 700\}$ patients are randomly sampled for each patient count.
  • Figure 4: (left) Validation accuracies of the best-performing ViT models trained with 1, 5, 10, 50, 100, 200, 400, 700 patients across all 10 training splits, and (right) corresponding F1-scores for the selected models.
  • Figure 5: Box plots illustrating the distribution of (a) accuracy as well as (b) precision on validation sets across ViT models trained with 1, 5, 10, 50, 100, 200, 400, and 700 patients. For each patient count in the training set, patients are randomly sampled from the training dataset to create 10 training splits.