Table of Contents
Fetching ...

Subgroup Performance of a Commercial Digital Breast Tomosynthesis Model for Breast Cancer Detection

Beatrice Brown-Mulry, Rohan Satya Isaac, Sang Hyup Lee, Ambika Seth, KyungJee Min, Theo Dapamede, Frank Li, Aawez Mansuri, MinJae Woo, Christian Allison Fauria-Robinson, Bhavna Paryani, Judy Wawira Gichoya, Hari Trivedi

TL;DR

This study delivers the first thorough subgroup evaluation of a commercial DBT AI model for breast cancer detection across demographic, imaging, and pathology subtypes using the EMBED dataset. The INSIGHT DBT model achieves an overall AUROC of $0.91$ and recall of $0.73$, but shows reduced performance for non-invasive cancers ($AUROC\approx0.85$), calcifications, and dense breast tissue, highlighting modality and pathology-specific limitations. The results demonstrate robust performance across many subgroups yet reveal important weaknesses that demand cautious interpretation and continuous, subgroup-aware validation before clinical deployment. These findings emphasize that AI tools in DBT should augment radiologists with awareness of when and where performance may drop, guiding safer adoption in screening programs.

Abstract

While research has established the potential of AI models for mammography to improve breast cancer screening outcomes, there have not been any detailed subgroup evaluations performed to assess the strengths and weaknesses of commercial models for digital breast tomosynthesis (DBT) imaging. This study presents a granular evaluation of the Lunit INSIGHT DBT model on a large retrospective cohort of 163,449 screening mammography exams from the Emory Breast Imaging Dataset (EMBED). Model performance was evaluated in a binary context with various negative exam types (162,081 exams) compared against screen detected cancers (1,368 exams) as the positive class. The analysis was stratified across demographic, imaging, and pathologic subgroups to identify potential disparities. The model achieved an overall AUC of 0.91 (95% CI: 0.90-0.92) with a precision of 0.08 (95% CI: 0.08-0.08), and a recall of 0.73 (95% CI: 0.71-0.76). Performance was found to be robust across demographics, but cases with non-invasive cancers (AUC: 0.85, 95% CI: 0.83-0.87), calcifications (AUC: 0.80, 95% CI: 0.78-0.82), and dense breast tissue (AUC: 0.90, 95% CI: 0.88-0.91) were associated with significantly lower performance compared to other groups. These results highlight the need for detailed evaluation of model characteristics and vigilance in considering adoption of new tools for clinical deployment.

Subgroup Performance of a Commercial Digital Breast Tomosynthesis Model for Breast Cancer Detection

TL;DR

This study delivers the first thorough subgroup evaluation of a commercial DBT AI model for breast cancer detection across demographic, imaging, and pathology subtypes using the EMBED dataset. The INSIGHT DBT model achieves an overall AUROC of and recall of , but shows reduced performance for non-invasive cancers (), calcifications, and dense breast tissue, highlighting modality and pathology-specific limitations. The results demonstrate robust performance across many subgroups yet reveal important weaknesses that demand cautious interpretation and continuous, subgroup-aware validation before clinical deployment. These findings emphasize that AI tools in DBT should augment radiologists with awareness of when and where performance may drop, guiding safer adoption in screening programs.

Abstract

While research has established the potential of AI models for mammography to improve breast cancer screening outcomes, there have not been any detailed subgroup evaluations performed to assess the strengths and weaknesses of commercial models for digital breast tomosynthesis (DBT) imaging. This study presents a granular evaluation of the Lunit INSIGHT DBT model on a large retrospective cohort of 163,449 screening mammography exams from the Emory Breast Imaging Dataset (EMBED). Model performance was evaluated in a binary context with various negative exam types (162,081 exams) compared against screen detected cancers (1,368 exams) as the positive class. The analysis was stratified across demographic, imaging, and pathologic subgroups to identify potential disparities. The model achieved an overall AUC of 0.91 (95% CI: 0.90-0.92) with a precision of 0.08 (95% CI: 0.08-0.08), and a recall of 0.73 (95% CI: 0.71-0.76). Performance was found to be robust across demographics, but cases with non-invasive cancers (AUC: 0.85, 95% CI: 0.83-0.87), calcifications (AUC: 0.80, 95% CI: 0.78-0.82), and dense breast tissue (AUC: 0.90, 95% CI: 0.88-0.91) were associated with significantly lower performance compared to other groups. These results highlight the need for detailed evaluation of model characteristics and vigilance in considering adoption of new tools for clinical deployment.

Paper Structure

This paper contains 28 sections, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Class assignment flowchart. Only the screen-detected cancer label was considered in the positive class, and all others were considered negative with the exception of interval cancers that were considered separately. Abnormal screening exams were divided into confirmed cancers, confirmed benign, and diagnostic negatives. Negative screening exams with at least one follow-up within 1-4 years were considered in the negative class. Exams that did not meet the criteria for any labels were excluded from further analysis.
  • Figure 2: Distribution of model predicted scores across outcomes labels: screen negative, diagnostic negative, biopsy proven benign, screen-detected cancer and interval cancer. Model prediction scores range from 0 to 1, with model operating point of 0.1. The chart illustrates the proportion of cases classified as positive or negative based on the set threshold, with shaded areas representing the distribution density.
  • Figure 3: Distribution of model predicted scores for pathology subtypes. Each category represents the most severe pathology identified for a given exam. Model prediction scores range from 0 to 1, with model operating point of 0.1. The chart illustrates the proportion of cases classified as positive or negative based on the set threshold, with shaded areas representing the distribution density.
  • Figure 4: Distribution of model predicted scores for various invasive cancer subtypes. The model operating point of 0.1 is marked by a horizontal line with the percentage of exams above and below the threshold indicated per pathology. IDC has many subtypes with varying imaging features and prognosis, and are represented by the adjacent panel. Model performance for invasive cancers is generally good, with lower proportion of positive cases within certain subtypes although low number of samples precludes any conclusions regarding subtypes.
  • Figure 5: Distribution of model predicted scores for various non-invasive cancer subtypes. The model operating point of 0.1 is marked by a horizontal line with the percentage of exams above and below the threshold indicated per pathology. DCIS grades and subtypes are represented in the adjacent panel. Model performance for non-invasive cancers was significantly lower for all non-invasive cancers compared to invasive cancers.
  • ...and 9 more figures