Table of Contents
Fetching ...

Mammo-FM: Breast-specific foundational model for Integrated Mammographic Diagnosis, Prognosis, and Reporting

Shantanu Ghosh, Vedant Parthesh Joshi, Rayan Syed, Aya Kassem, Abhishek Varshney, Payel Basak, Weicheng Dai, Judy Wawira Gichoya, Hari M. Trivedi, Imon Banerjee, Shyam Visweswaran, Clare B. Poynton, Kayhan Batmanghelich

TL;DR

Mammo-FM is introduced, the first foundation model specifically for mammography, pretrained on the largest and most diverse dataset to date, and consistently outperforms state-of-the-art generalist FMs across multiple public and private benchmarks.

Abstract

Breast cancer is one of the leading causes of death among women worldwide. We introduce Mammo-FM, the first foundation model specifically for mammography, pretrained on the largest and most diverse dataset to date - 140,677 patients (821,326 mammograms) across four U.S. institutions. Mammo-FM provides a unified foundation for core clinical tasks in breast imaging, including cancer diagnosis, pathology localization, structured report generation, and cancer risk prognosis within a single framework. Its alignment between images and text enables both visual and textual interpretability, improving transparency and clinical auditability, which are essential for real-world adoption. We rigorously evaluate Mammo-FM across diagnosis, prognosis, and report-generation tasks in in- and out-of-distribution datasets. Despite operating on native-resolution mammograms and using only one-third of the parameters of state-of-the-art generalist FMs, Mammo-FM consistently outperforms them across multiple public and private benchmarks. These results highlight the efficiency and value of domain-specific foundation models designed around the full spectrum of tasks within a clinical domain and emphasize the importance of rigorous, domain-aligned evaluation.

Mammo-FM: Breast-specific foundational model for Integrated Mammographic Diagnosis, Prognosis, and Reporting

TL;DR

Mammo-FM is introduced, the first foundation model specifically for mammography, pretrained on the largest and most diverse dataset to date, and consistently outperforms state-of-the-art generalist FMs across multiple public and private benchmarks.

Abstract

Breast cancer is one of the leading causes of death among women worldwide. We introduce Mammo-FM, the first foundation model specifically for mammography, pretrained on the largest and most diverse dataset to date - 140,677 patients (821,326 mammograms) across four U.S. institutions. Mammo-FM provides a unified foundation for core clinical tasks in breast imaging, including cancer diagnosis, pathology localization, structured report generation, and cancer risk prognosis within a single framework. Its alignment between images and text enables both visual and textual interpretability, improving transparency and clinical auditability, which are essential for real-world adoption. We rigorously evaluate Mammo-FM across diagnosis, prognosis, and report-generation tasks in in- and out-of-distribution datasets. Despite operating on native-resolution mammograms and using only one-third of the parameters of state-of-the-art generalist FMs, Mammo-FM consistently outperforms them across multiple public and private benchmarks. These results highlight the efficiency and value of domain-specific foundation models designed around the full spectrum of tasks within a clinical domain and emphasize the importance of rigorous, domain-aligned evaluation.

Paper Structure

This paper contains 10 sections, 7 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: Overview of the Mammo-FM foundation model and downstream applications.a. Schematic of the Mammo-FM framework. High-resolution mammographic views (CC, MLO) and paired radiology reports are jointly used for multi-view contrastive pretraining, aligning image and text representations within a shared embedding space (see Methods). b. Composition of multi-institutional pretraining datasets from the Mayo Clinic, BU, UPMC, and EMBED cohorts. c. Composition of evaluation datasets used to benchmark diagnostic and prognostic generalization across in-distribution (ID; EMBED) and out-of-distribution (OOD; RSNA, VinDr) settings. d. Pathology localization by integrating the Mammo-FM vision encoder into a RetinaNet detector. e. Zero-shot classification with Mammo-FM. Image and text encoders map mammograms and prompts into a shared space. Similarity between embeddings enables classification without fine-tuning. f. Findings classification under either linear-probe or full fine-tuning settings, where a linear classifier is attached to Mammo-FM image encoder, quantifying the quality of the learned visual representations. g. Interpretable risk prediction built using frozen Mammo-FM encoders across CC and MLO views for 1–5 year risk estimation. h. Automated report generation (Mammo-GRG) combining Mammo-FM with a Llama-3.1-8B language model to produce clinically grounded radiology reports.
  • Figure 2: Evaluation of Mammo-FM representations for diagnosing mammographic findings.a. Schematic overview of zero-shot classification, where aligned image and text encoders predict the presence of mammographic findings ( e.g., mass, calcification) by measuring image–text embedding similarity, without task-specific training. b. Zero-shot evaluation of breast cancer and finding classification performance on in-distribution (ID; EMBED) and out-of-distribution (OOD; VinDr, RSNA) datasets. c. Illustration of the linear probing setup, in which only a classification head is trained atop the frozen vision encoder to assess representation quality. d. Schematic of the full fine-tuning configuration, where the entire model is optimized end-to-end for downstream diagnostic tasks. e. Data efficiency analysis across ID and OOD datasets, showing progressive improvement in classification performance with increasing fractions of training data. f. Pathology localization using a RetinaNet detection framework with a Mammo-FM vision encoder as backbone. g. Pathology localization performance detecting mass and calcification from VinDr dataset. h-i, Quantitative results for OOD and ID classification under full fine-tuning (top panels) and linear probing (bottom panels), reported as AUROC. Across all tasks and datasets, Mammo-FM (multi-institution) shows higher diagnostic accuracy, superior generalization, and data efficiency compared with single-institution, generalist, and self-supervised baselines -- highlighting the importance of domain-specific, multi-institutional pre-training for clinically robust mammographic representation learning. In all the evaluations, we denote suspicious calcification as calcification for brevity.
  • Figure 3: Integrating Mammo-FM representations with risk prediction pipelines and interpretable modeling.a. Training pipeline of the risk predictors -- MIRAI w/ Mammo-FM and AsymMIRAI w/ Mammo-FM using knowledge distillation from 1–5-year risk outputs of the original MIRAI model. We freeze the image encoder throughout the training. b. We replace the standard ResNet-18 encoder in MIRAI and AsymMIRAI with a Mammo-FM-trained encoder that aligns images and reports, enabling richer and more transferable representations. c. The Matryoshka Sparse Autoencoder (SAE) trains on 2048-dimensional spatial Mammo-FM features (48×29 patch grid) and expands them into a 16,384-dimensional sparse latent space, creating interpretable neuron-level representations. d. Risk predictors use SAE features for interpretability: turning neurons on and off quantifies their effect on risk. Shapley analysis ranks neurons by their contribution, linking top activations to both spatial image regions and key radiology report sentences. e. Comparison of AUROC for cancer prediction MIRAI w/ Mammo-FM, AsymMIRAI w/ Mammo-FM, and their baselines for VinDr (OOD) and RSNA (OOD) datasets. Each curve reports the mean AUROC with 95% confidence intervals (CIs). f. Comparison of AUROC for 1–5-year risk prediction MIRAI w/ Mammo-FM, AsymMIRAI w/ Mammo-FM, and their baselines using BU (ID) dataset, showing that Mammo-FM-based encoders improve performance while maintaining interpretability. Each curve reports the mean AUROC with 95% confidence intervals (CIs). g. Correlation between MIRAI and MIRAI w/ Mammo-FM risks demonstrates consistent risks across 5-year time horizons using BU (ID) dataset. Reported correlation coefficients (Pearson’s $r$) with 95% CIs. h. SAE-reconstructed features retain discriminative fidelity in classification and detection relative to original Mammo-FM features. i. Qualitative visualization of MIRAI w/ Mammo-FM risk interpretation for a high-risk 1-year case (MIRAI risk = 0.74) in the BU dataset shows dual interpretability: image-based saliency highlights a neuron selective for mass features, while the corresponding report sentence yields the highest similarity drop for risk, indicating language-based interpretability.
  • Figure 4: Overview of Mammo-GRG. a. Schematic of the Mammo-GRG architecture. Four-view screening mammograms (LCC, LMLO, RCC, RMLO) are encoded by dedicated Mammo-FM vision encoders and projected to the latent space of a Llama-3.1-8B language model via a multimodal projector. View-specific tokens and positional embeddings preserve spatial and semantic distinctions, enabling cross-view reasoning. b. Clinical grounding stage where Mammo-GRG generates the final radiology report from the preliminary report. Mammo-GRG leverages Mammo-FM in zero-shot mode to predict key findings (mass, calcification, asymmetry) that constrain report generation. c. The GREEN evaluation framework evaluates the factual and clinical correctness of the generated report by comparing candidate and target reports using an LLM judge (GPT-4o-mini). d. Structured findings are extracted from both generated and reference reports using GPT-4o-mini. e–f Lexical (BLEU-1, ROUGE-L) and clinical (GREEN) performance on BU and UPMC datasets demonstrate consistent improvements of Mammo-GRG over fine-tuned MedGemma and LLaVA-Med baselines. g. Recall of key mammographic findings extracted from generated versus reference reports across datasets and laterality, showing higher clinical recall for Mammo-GRG. h. Pathological grounding evaluation, showing recall for mass and calcification findings extracted from generated reports compared directly to image-level ground-truth labels on the in-distribution (EMBED) and out-of-distribution (VinDr) datasets. Mammo-GRG maintains strong predictive alignment with actual imaging findings. i. Qualitative example illustrating the GREEN score assessment of a Mammo-GRG-generated report against ground truth, highlighting accurate recovery of findings and clinical consistency.
  • Figure 5: Distributions of radiology report lengths across datasets.a. Violin plots showing the distribution of report word counts for breast imaging datasets from BU and UPMC. Each violin displays the full distribution of report lengths, with internal boxes denoting the interquartile range and median. Reports from UPMC were generally longer and exhibited higher variance compared with BU. b. Histogram and kernel density estimates (KDEs) comparing report length frequency between BU and UPMC. Dashed vertical lines indicate mean word counts for each dataset (BU mean = 32 words; UPMC mean = 47 words). The distributions highlight systematic differences in reporting style and verbosity across clinical sites.
  • ...and 15 more figures