MamaDino: A Hybrid Vision Model for Breast Cancer 3-Year Risk Prediction
Ruggiero Santeramo, Igor Zubarev, Florian Jug
TL;DR
MamaDino addresses the challenge of accurate 3-year breast cancer risk prediction while using lower-resolution mammograms. It combines a frozen DINOv3 vision transformer with a trainable SE-ResNeXt backbone and a BilateralMixer to fuse bilateral views, enabling explicit contralateral reasoning. On OPTIMAM UK data, MamaDino matches or surpasses Mirai while using about 13× fewer input pixels and improves further with the BilateralMixer to an internal AUC of $0.736$ and external AUC of $0.677$. The results suggest that thoughtful architectural priors and bilateral context can close the gap to high-resolution CNNs, with potential to streamline risk-based screening.
Abstract
Breast cancer screening programmes increasingly seek to move from one-size-fits-all interval to risk-adapted and personalized strategies. Deep learning (DL) has enabled image-based risk models with stronger 1- to 5-year prediction than traditional clinical models, but leading systems (e.g., Mirai) typically use convolutional backbones, very high-resolution inputs (>1M pixels) and simple multi-view fusion, with limited explicit modelling of contralateral asymmetry. We hypothesised that combining complementary inductive biases (convolutional and transformer-based) with explicit contralateral asymmetry modelling would allow us to match state-of-the-art 3-year risk prediction performance even when operating on substantially lower-resolution mammograms, indicating that using less detailed images in a more structured way can recover state-of-the-art accuracy. We present MamaDino, a mammography-aware multi-view attentional DINO model. MamaDino fuses frozen self-supervised DINOv3 ViT-S features with a trainable CNN encoder at 512x512 resolution, and aggregates bilateral breast information via a BilateralMixer to output a 3-year breast cancer risk score. We train on 53,883 women from OPTIMAM (UK) and evaluate on matched 3-year case-control cohorts: an in-distribution test set from four screening sites and an external out-of-distribution cohort from an unseen site. At breast-level, MamaDino matches Mirai on both internal and external tests while using ~13x fewer input pixels. Adding the BilateralMixer improves discrimination to AUC 0.736 (vs 0.713) in-distribution and 0.677 (vs 0.666) out-of-distribution, with consistent performance across age, ethnicity, scanner, tumour type and grade. These findings demonstrate that explicit contralateral modelling and complementary inductive biases enable predictions that match Mirai, despite operating on substantially lower-resolution mammograms.
