Detecting and Monitoring Bias for Subgroups in Breast Cancer Detection AI
Amit Kumar Kundu, Florence X. Doo, Vaishnavi Patil, Amitabh Varshney, Joseph Jaja
TL;DR
This work tackles bias and distribution shift in breast cancer detection AI by evaluating high-performing models on the EMBED and RSNA datasets across six subgroups defined by age, race, density, view, scanner, and site. It develops an EMBED-based model with a ConvNeXt backbone and advanced training techniques, and compares performance to a top RSNA challenge model, reporting comprehensive subgroup metrics. A CUSUM-based statistical process control framework is introduced to detect performance drifts over time, enabling timely interventions such as retraining or recalibration. The findings reveal subgroup-specific disparities (e.g., age- and race-related sensitivity and PPV differences) and demonstrate that continuous monitoring can maintain fairness and accuracy in real-world deployment, guiding safer, more equitable clinical use of mammography AI.
Abstract
Automated mammography screening plays an important role in early breast cancer detection. However, current machine learning models, developed on some training datasets, may exhibit performance degradation and bias when deployed in real-world settings. In this paper, we analyze the performance of high-performing AI models on two mammography datasets-the Emory Breast Imaging Dataset (EMBED) and the RSNA 2022 challenge dataset. Specifically, we evaluate how these models perform across different subgroups, defined by six attributes, to detect potential biases using a range of classification metrics. Our analysis identifies certain subgroups that demonstrate notable underperformance, highlighting the need for ongoing monitoring of these subgroups' performance. To address this, we adopt a monitoring method designed to detect performance drifts over time. Upon identifying a drift, this method issues an alert, which can enable timely interventions. This approach not only provides a tool for tracking the performance but also helps ensure that AI models continue to perform effectively across diverse populations.
