Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

Shantanu Ghosh; Clare B. Poynton; Shyam Visweswaran; Kayhan Batmanghelich

Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

Shantanu Ghosh, Clare B. Poynton, Shyam Visweswaran, Kayhan Batmanghelich

TL;DR

Mammo-CLIP addresses data scarcity in mammography by pretraining a vision-language model on mammogram–report pairs, jointly learning image and text embeddings with a cross-modal contrastive objective $\mathcal{L}$. It enhances data efficiency through multi-view and instance/dataset augmentation and introduces Mammo-FActOR to provide attribute-aligned, spatial interpretability via heatmaps mapped to the image encoder channels. Evaluated on VinDr and RSNA, Mammo-CLIP demonstrates strong classification and localization performance, with zero-shot and low-data regimes approaching or surpassing baselines and Mammo-FActOR enabling weakly supervised localization without bounding boxes. The work advances practical breast cancer screening by enabling robust, data-efficient learning with interpretable, attribute-grounded reasoning in mammography.

Abstract

The lack of large and diverse training data on Computer-Aided Diagnosis (CAD) in breast cancer detection has been one of the concerns that impedes the adoption of the system. Recently, pre-training with large-scale image text datasets via Vision-Language models (VLM) (\eg CLIP) partially addresses the issue of robustness and data efficiency in computer vision (CV). This paper proposes Mammo-CLIP, the first VLM pre-trained on a substantial amount of screening mammogram-report pairs, addressing the challenges of dataset diversity and size. Our experiments on two public datasets demonstrate strong performance in classifying and localizing various mammographic attributes crucial for breast cancer detection, showcasing data efficiency and robustness similar to CLIP in CV. We also propose Mammo-FActOR, a novel feature attribution method, to provide spatial interpretation of representation with sentence-level granularity within mammography reports. Code is available publicly: \url{https://github.com/batmanlab/Mammo-CLIP}.

Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

TL;DR

. It enhances data efficiency through multi-view and instance/dataset augmentation and introduces Mammo-FActOR to provide attribute-aligned, spatial interpretability via heatmaps mapped to the image encoder channels. Evaluated on VinDr and RSNA, Mammo-CLIP demonstrates strong classification and localization performance, with zero-shot and low-data regimes approaching or surpassing baselines and Mammo-FActOR enabling weakly supervised localization without bounding boxes. The work advances practical breast cancer screening by enabling robust, data-efficient learning with interpretable, attribute-grounded reasoning in mammography.

Abstract

Paper Structure (13 sections, 3 equations, 4 figures, 5 tables)

This paper contains 13 sections, 3 equations, 4 figures, 5 tables.

Introduction
Method
Mammo-CLIP
Instance and Dataset Augmentation
Mammo-FActOR
Experiments
Datasets
Experimental details
Results
Conclusion
Zero-shot prompts for density
Prompts to synthesize report-like sentences from attributes in image-mammographic attribute dataset
Visualization of activation maps identified by Mammo-FActOR

Figures (4)

Figure 1: Schematic view of our method. (a) Image-text augmentation for MVS. (b) Dataset augmentation by synthesizing reports using image-attribute datasets. (c) Mammo-CLIP pretraining strategy. (d) Feature attribution using Mammo-FACtoR.
Figure 1: Example of report-like sentence generation for the attribute mass labeled positively in the VinDr dataset using the subtypes of mass and position, laterality, and depth of the breast. We include all such prompts in our codebase in details.
Figure 2: Mammo-FACtoR localizes mass and calcification w/o the ground-truth bboxes from the VinDr dataset. For each pair, the left image denotes the original image with the ground-truth bbox, while the right one is the bbox predicted by Mammo-Factor.
Figure 2: Visualization of activation maps identified by Mammo-FActOR to localize mass and calcification attributes via report. For each row, the left image is the original image. The middle three denote top-3 (left to right) activation maps with the respective unit number. The right one indicates the summation of the top-3 feature maps. We show the ground-truth bounding boxes with the red rectangle. Notably, the optimal feature map identified by Mammo-FActOR (Feature map unit 1305 for mass and 1150 for calcification) successfully localizes the attributes.

Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

TL;DR

Abstract

Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography

Authors

TL;DR

Abstract

Table of Contents

Figures (4)