Table of Contents
Fetching ...

MamaDino: A Hybrid Vision Model for Breast Cancer 3-Year Risk Prediction

Ruggiero Santeramo, Igor Zubarev, Florian Jug

TL;DR

MamaDino addresses the challenge of accurate 3-year breast cancer risk prediction while using lower-resolution mammograms. It combines a frozen DINOv3 vision transformer with a trainable SE-ResNeXt backbone and a BilateralMixer to fuse bilateral views, enabling explicit contralateral reasoning. On OPTIMAM UK data, MamaDino matches or surpasses Mirai while using about 13× fewer input pixels and improves further with the BilateralMixer to an internal AUC of $0.736$ and external AUC of $0.677$. The results suggest that thoughtful architectural priors and bilateral context can close the gap to high-resolution CNNs, with potential to streamline risk-based screening.

Abstract

Breast cancer screening programmes increasingly seek to move from one-size-fits-all interval to risk-adapted and personalized strategies. Deep learning (DL) has enabled image-based risk models with stronger 1- to 5-year prediction than traditional clinical models, but leading systems (e.g., Mirai) typically use convolutional backbones, very high-resolution inputs (>1M pixels) and simple multi-view fusion, with limited explicit modelling of contralateral asymmetry. We hypothesised that combining complementary inductive biases (convolutional and transformer-based) with explicit contralateral asymmetry modelling would allow us to match state-of-the-art 3-year risk prediction performance even when operating on substantially lower-resolution mammograms, indicating that using less detailed images in a more structured way can recover state-of-the-art accuracy. We present MamaDino, a mammography-aware multi-view attentional DINO model. MamaDino fuses frozen self-supervised DINOv3 ViT-S features with a trainable CNN encoder at 512x512 resolution, and aggregates bilateral breast information via a BilateralMixer to output a 3-year breast cancer risk score. We train on 53,883 women from OPTIMAM (UK) and evaluate on matched 3-year case-control cohorts: an in-distribution test set from four screening sites and an external out-of-distribution cohort from an unseen site. At breast-level, MamaDino matches Mirai on both internal and external tests while using ~13x fewer input pixels. Adding the BilateralMixer improves discrimination to AUC 0.736 (vs 0.713) in-distribution and 0.677 (vs 0.666) out-of-distribution, with consistent performance across age, ethnicity, scanner, tumour type and grade. These findings demonstrate that explicit contralateral modelling and complementary inductive biases enable predictions that match Mirai, despite operating on substantially lower-resolution mammograms.

MamaDino: A Hybrid Vision Model for Breast Cancer 3-Year Risk Prediction

TL;DR

MamaDino addresses the challenge of accurate 3-year breast cancer risk prediction while using lower-resolution mammograms. It combines a frozen DINOv3 vision transformer with a trainable SE-ResNeXt backbone and a BilateralMixer to fuse bilateral views, enabling explicit contralateral reasoning. On OPTIMAM UK data, MamaDino matches or surpasses Mirai while using about 13× fewer input pixels and improves further with the BilateralMixer to an internal AUC of and external AUC of . The results suggest that thoughtful architectural priors and bilateral context can close the gap to high-resolution CNNs, with potential to streamline risk-based screening.

Abstract

Breast cancer screening programmes increasingly seek to move from one-size-fits-all interval to risk-adapted and personalized strategies. Deep learning (DL) has enabled image-based risk models with stronger 1- to 5-year prediction than traditional clinical models, but leading systems (e.g., Mirai) typically use convolutional backbones, very high-resolution inputs (>1M pixels) and simple multi-view fusion, with limited explicit modelling of contralateral asymmetry. We hypothesised that combining complementary inductive biases (convolutional and transformer-based) with explicit contralateral asymmetry modelling would allow us to match state-of-the-art 3-year risk prediction performance even when operating on substantially lower-resolution mammograms, indicating that using less detailed images in a more structured way can recover state-of-the-art accuracy. We present MamaDino, a mammography-aware multi-view attentional DINO model. MamaDino fuses frozen self-supervised DINOv3 ViT-S features with a trainable CNN encoder at 512x512 resolution, and aggregates bilateral breast information via a BilateralMixer to output a 3-year breast cancer risk score. We train on 53,883 women from OPTIMAM (UK) and evaluate on matched 3-year case-control cohorts: an in-distribution test set from four screening sites and an external out-of-distribution cohort from an unseen site. At breast-level, MamaDino matches Mirai on both internal and external tests while using ~13x fewer input pixels. Adding the BilateralMixer improves discrimination to AUC 0.736 (vs 0.713) in-distribution and 0.677 (vs 0.666) out-of-distribution, with consistent performance across age, ethnicity, scanner, tumour type and grade. These findings demonstrate that explicit contralateral modelling and complementary inductive biases enable predictions that match Mirai, despite operating on substantially lower-resolution mammograms.
Paper Structure (23 sections, 2 equations, 3 figures, 4 tables)

This paper contains 23 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: MamaDino overview: four standard mammography views are processed through a hybrid CNN--DINOv3 backbone with per-channel augmentation and bilateral mixing to produce a 3-year breast cancer risk score, achieving higher AUC than Mirai while operating at substantially lower image resolution.
  • Figure 2: MamaDino architecture:(a) Four standard mammography views per exam (at $512\times512$ resolution) are processed by a hybrid fusion encoder that combines a frozen DINOv3 for global semantics with a trainable SE-ResNeXt101 CNN for local texture, producing per-breast embeddings that are fused for 3-year malignancy risk prediction in Stage 2 via the BilateralMixer. (b) The BridgeMixer block aligns Transformer tokens with convolutional feature maps via spatial cross-attention and 1×1-convolution fusion. (c) The BilateralMixer block takes left and right breast embeddings, models bilateral concordance and asymmetry through a transformer, and outputs the final risk score.
  • Figure A1: Per-channel augmentation vs. simple replication ablation. Validation AUC of the breast-level MamaDino encoder on a held-out OPTIMAM validation set as a function of input resolution (224, 320, 512). Curves show mean AUC across random initialisations (2 seed per setup), with shaded regions indicating the range across runs. The blue line (“3-Channel Repeats”) corresponds to the standard practice of replicating the grayscale image into three identical channels before augmentation. The orange line (“Per-Channel Augmentation”) corresponds to our proposed strategy, where each channel is independently perturbed (brightness/contrast jitter and CLAHE) before being recombined. Across all resolutions, per-channel augmentation yields consistently higher AUC, with the largest gain at 512×512.