Table of Contents
Fetching ...

Toward explainable AI approaches for breast imaging: adapting foundation models to diverse populations

Guilherme J. Cavalcante, José Gabriel A. Moreira, Gabriel A. B. do Nascimento, Vincent Dong, Alex Nguyen, Thaís G. do Rêgo, Yuri Malheiros, Telmo M. Silva Filho, Carla R. Zeballos Torrez, James C. Gee, Anne Marie McCarthy, Andrew D. A. Maidment, Bruno Barufaldi

TL;DR

The paper addresses automated BI-RADS density classification in mammography by adapting a foundation model (BiomedCLIP) to handle multi-modality data (synthesized 2D, digital mammography, and digital breast tomosynthesis). It employs image-text contrastive learning with weighted losses and stratified group cross-validation, achieving comparable accuracy to single-modality training while delivering strong external validation performance (AUC 0.80–0.93) and GradCAM-based interpretability. The results demonstrate cross-modality generalization and explainability, highlighting the potential for foundation-model–driven breast imaging tasks beyond density estimation. This work lays groundwork for expanding to diagnostic tasks such as lesion detection and image retrieval while maintaining clinical interpretability.

Abstract

Foundation models hold promise for specialized medical imaging tasks, though their effectiveness in breast imaging remains underexplored. This study leverages BiomedCLIP as a foundation model to address challenges in model generalization. BiomedCLIP was adapted for automated BI-RADS breast density classification using multi-modality mammographic data (synthesized 2D images, digital mammography, and digital breast tomosynthesis). Using 96,995 images, we compared single-modality (s2D only) and multi-modality training approaches, addressing class imbalance through weighted contrastive learning. Both approaches achieved similar accuracy (multi-modality: 0.74, single-modality: 0.73), with the multi-modality model offering broader applicability across different imaging modalities and higher AUC values consistently above 0.84 across BI-RADS categories. External validation on the RSNA and EMBED datasets showed strong generalization capabilities (AUC range: 0.80-0.93). GradCAM visualizations confirmed consistent and clinically relevant attention patterns, highlighting the models interpretability and robustness. This research underscores the potential of foundation models for breast imaging applications, paving the way for future extensions for diagnostic tasks.

Toward explainable AI approaches for breast imaging: adapting foundation models to diverse populations

TL;DR

The paper addresses automated BI-RADS density classification in mammography by adapting a foundation model (BiomedCLIP) to handle multi-modality data (synthesized 2D, digital mammography, and digital breast tomosynthesis). It employs image-text contrastive learning with weighted losses and stratified group cross-validation, achieving comparable accuracy to single-modality training while delivering strong external validation performance (AUC 0.80–0.93) and GradCAM-based interpretability. The results demonstrate cross-modality generalization and explainability, highlighting the potential for foundation-model–driven breast imaging tasks beyond density estimation. This work lays groundwork for expanding to diagnostic tasks such as lesion detection and image retrieval while maintaining clinical interpretability.

Abstract

Foundation models hold promise for specialized medical imaging tasks, though their effectiveness in breast imaging remains underexplored. This study leverages BiomedCLIP as a foundation model to address challenges in model generalization. BiomedCLIP was adapted for automated BI-RADS breast density classification using multi-modality mammographic data (synthesized 2D images, digital mammography, and digital breast tomosynthesis). Using 96,995 images, we compared single-modality (s2D only) and multi-modality training approaches, addressing class imbalance through weighted contrastive learning. Both approaches achieved similar accuracy (multi-modality: 0.74, single-modality: 0.73), with the multi-modality model offering broader applicability across different imaging modalities and higher AUC values consistently above 0.84 across BI-RADS categories. External validation on the RSNA and EMBED datasets showed strong generalization capabilities (AUC range: 0.80-0.93). GradCAM visualizations confirmed consistent and clinically relevant attention patterns, highlighting the models interpretability and robustness. This research underscores the potential of foundation models for breast imaging applications, paving the way for future extensions for diagnostic tasks.

Paper Structure

This paper contains 6 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Grad-CAM visualizations across BI-RADS categories A–D (left) and confusion matrix on the validation set (right), using data from a single site (s2D only). Note: Warmer colors indicate higher model attention.
  • Figure 2: Grad-CAM visualizations and confusion matrices for two external datasets. Left panel (I): two RSNA examples, each showing the original image (left) and the corresponding Grad-CAM overlay (right), along with the confusion matrix. Right panel (II): two EMBED examples with the same layout, and the associated confusion matrix shown on the right.
  • Figure 3: GradCAM visualizations demonstrating model robustness to imaging variations (paddles, annotations, implants) while maintaining focus on breast tissue density patterns.