Table of Contents
Fetching ...

MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization

Yalda Zafari, Hongyi Pan, Gorkem Durak, Ulas Bagci, Essam A. Rashed, Mohamed Mabrok

TL;DR

This work tackles the critical problem of dataset heterogeneity in mammography that undermines AI generalization. It introduces MammoClean, a public, modular pipeline for harmonizing imaging data and metadata while enabling systematic bias quantification across multi-view mammography datasets. Through application to CBIS-DDSM, TOMPEI-CMMD, and VinDr-Mammo, the authors demonstrate how standardization reduces inconsistencies, reveals cross-dataset biases (e.g., in breast density and BI-RADS distribution), and supports fairer, cross-domain model development. The paper emphasizes bias-aware evaluation, clinically aligned decision-making, and calls for richer longitudinal and multimodal datasets to enhance robust, equitable AI for breast cancer screening.

Abstract

The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify substantial distributional shifts in breast density and abnormality prevalence. Critically, we demonstrate the direct impact of data corruption: AI models trained on corrupted datasets exhibit significant performance degradation compared to their curated counterparts. By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora that enable development of robust models with superior cross-domain generalization. MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography, facilitating fairer comparisons and advancing the creation of safe, effective systems that perform equitably across diverse patient populations and clinical settings. The open-source code is publicly available from: https://github.com/Minds-R-Lab/MammoClean.

MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization

TL;DR

This work tackles the critical problem of dataset heterogeneity in mammography that undermines AI generalization. It introduces MammoClean, a public, modular pipeline for harmonizing imaging data and metadata while enabling systematic bias quantification across multi-view mammography datasets. Through application to CBIS-DDSM, TOMPEI-CMMD, and VinDr-Mammo, the authors demonstrate how standardization reduces inconsistencies, reveals cross-dataset biases (e.g., in breast density and BI-RADS distribution), and supports fairer, cross-domain model development. The paper emphasizes bias-aware evaluation, clinically aligned decision-making, and calls for richer longitudinal and multimodal datasets to enhance robust, equitable AI for breast cancer screening.

Abstract

The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify substantial distributional shifts in breast density and abnormality prevalence. Critically, we demonstrate the direct impact of data corruption: AI models trained on corrupted datasets exhibit significant performance degradation compared to their curated counterparts. By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora that enable development of robust models with superior cross-domain generalization. MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography, facilitating fairer comparisons and advancing the creation of safe, effective systems that perform equitably across diverse patient populations and clinical settings. The open-source code is publicly available from: https://github.com/Minds-R-Lab/MammoClean.

Paper Structure

This paper contains 18 sections, 1 equation, 12 figures, 3 tables.

Figures (12)

  • Figure 1: The role of mammography imaging in breast cancer screening and diagnosis.
  • Figure 2: Illustration of breast anatomy and its appearance across different mammographic views, images are from nguyen2023vindr.
  • Figure 3: Causal relationships between various factors and their impact on mammography images for both disease-related and non-disease-related features. Patient gender, ethnicity, and age are common factors influencing breast density, which directly affects non-disease visual appearances in images and, by altering breast cancer risk and detection difficulty, also impacts disease-related features.
  • Figure 4: Different mass types, images are from song2009breast: (1) round with circumscribed margins (BI-RADS 2, benign), (2) oval with circumscribed margins (BI-RADS 3, benign), (3) oval with ill-defined margins (BI-RADS 4, malignant), and (4) irregular shape with spiculated margins (BI-RADS 5, malignant).
  • Figure 5: Different calcification types, images are from cui2021chinesekashiwada2025tompei: (1) small round calcifications with scattered distribution throughout the whole breast (BI-RADS 2, benign), (2) pleomorphic calcifications with segmental distribution in the medial breast (BI-RADS 5, malignant), and (3) amorphous, indistinct calcifications with grouped distribution in the medial breast (BI-RADS 3, malignant).
  • ...and 7 more figures