Table of Contents
Fetching ...

Detecting and Monitoring Bias for Subgroups in Breast Cancer Detection AI

Amit Kumar Kundu, Florence X. Doo, Vaishnavi Patil, Amitabh Varshney, Joseph Jaja

TL;DR

This work tackles bias and distribution shift in breast cancer detection AI by evaluating high-performing models on the EMBED and RSNA datasets across six subgroups defined by age, race, density, view, scanner, and site. It develops an EMBED-based model with a ConvNeXt backbone and advanced training techniques, and compares performance to a top RSNA challenge model, reporting comprehensive subgroup metrics. A CUSUM-based statistical process control framework is introduced to detect performance drifts over time, enabling timely interventions such as retraining or recalibration. The findings reveal subgroup-specific disparities (e.g., age- and race-related sensitivity and PPV differences) and demonstrate that continuous monitoring can maintain fairness and accuracy in real-world deployment, guiding safer, more equitable clinical use of mammography AI.

Abstract

Automated mammography screening plays an important role in early breast cancer detection. However, current machine learning models, developed on some training datasets, may exhibit performance degradation and bias when deployed in real-world settings. In this paper, we analyze the performance of high-performing AI models on two mammography datasets-the Emory Breast Imaging Dataset (EMBED) and the RSNA 2022 challenge dataset. Specifically, we evaluate how these models perform across different subgroups, defined by six attributes, to detect potential biases using a range of classification metrics. Our analysis identifies certain subgroups that demonstrate notable underperformance, highlighting the need for ongoing monitoring of these subgroups' performance. To address this, we adopt a monitoring method designed to detect performance drifts over time. Upon identifying a drift, this method issues an alert, which can enable timely interventions. This approach not only provides a tool for tracking the performance but also helps ensure that AI models continue to perform effectively across diverse populations.

Detecting and Monitoring Bias for Subgroups in Breast Cancer Detection AI

TL;DR

This work tackles bias and distribution shift in breast cancer detection AI by evaluating high-performing models on the EMBED and RSNA datasets across six subgroups defined by age, race, density, view, scanner, and site. It develops an EMBED-based model with a ConvNeXt backbone and advanced training techniques, and compares performance to a top RSNA challenge model, reporting comprehensive subgroup metrics. A CUSUM-based statistical process control framework is introduced to detect performance drifts over time, enabling timely interventions such as retraining or recalibration. The findings reveal subgroup-specific disparities (e.g., age- and race-related sensitivity and PPV differences) and demonstrate that continuous monitoring can maintain fairness and accuracy in real-world deployment, guiding safer, more equitable clinical use of mammography AI.

Abstract

Automated mammography screening plays an important role in early breast cancer detection. However, current machine learning models, developed on some training datasets, may exhibit performance degradation and bias when deployed in real-world settings. In this paper, we analyze the performance of high-performing AI models on two mammography datasets-the Emory Breast Imaging Dataset (EMBED) and the RSNA 2022 challenge dataset. Specifically, we evaluate how these models perform across different subgroups, defined by six attributes, to detect potential biases using a range of classification metrics. Our analysis identifies certain subgroups that demonstrate notable underperformance, highlighting the need for ongoing monitoring of these subgroups' performance. To address this, we adopt a monitoring method designed to detect performance drifts over time. Upon identifying a drift, this method issues an alert, which can enable timely interventions. This approach not only provides a tool for tracking the performance but also helps ensure that AI models continue to perform effectively across diverse populations.

Paper Structure

This paper contains 28 sections, 1 equation, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Overview of the proposed framework. The framework assesses BCD performance across subgroups to understand the impacts of different attributes, enabling pre-deployment bias identification and post-deployment drift monitoring for alerts.
  • Figure 2: Model performance for different attributes. PPV and sensitivity are shown on the left axis, while AUROC is shown on the right axis. Error bars represent the standard deviation of metrics.
  • Figure 3: CUSUM based sensitivity monitoring charts. The batches are assumed to form a sequential data. After time index 100, we add more samples from the underperforming (a) the NHPI and (b) 'age $\ge 80$' groups of EMBED to each batch to introduce performance drift.
  • Figure 4: Monitoring sensitivity drops under the distribution shifts of subgroups, such as by deviating the proportion of (a) age groups from EMBED, (b) racial groups from EMBED and (a) age groups from RSNA. Here, $k$ is set to 0.
  • Figure 5: Sample mammograms for different BIRADS scores
  • ...and 1 more figures