Table of Contents
Fetching ...

MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction

Shunjie-Fabian Zheng, Hyeonjun Lee, Thijs Kooi, Ali Diba

TL;DR

MV-MLM addresses data scarcity in mammography by marrying high-resolution multi-view images with synthetic radiology reports generated from tabular metadata through a multi-view vision-language contrastive framework. The model learns robust cross-modal representations via image-text and multi-view contrastive losses, culminating in a MV-CLIP objective that improves malignancy, mass and calcification classification, and breast cancer risk prediction while using synthetic text and without real radiology reports. It achieves state-of-the-art performance on public datasets (VinDr-Mammo and RSNA-Mammo) and demonstrates strong data efficiency and generalization, including linear-probe transfer from synthetic text. The approach reduces reliance on manually annotated reports and has potential clinical impact by enabling scalable, accurate breast cancer screening and risk assessment.

Abstract

Large annotated datasets are essential for training robust Computer-Aided Diagnosis (CAD) models for breast cancer detection or risk prediction. However, acquiring such datasets with fine-detailed annotation is both costly and time-consuming. Vision-Language Models (VLMs), such as CLIP, which are pre-trained on large image-text pairs, offer a promising solution by enhancing robustness and data efficiency in medical imaging tasks. This paper introduces a novel Multi-View Mammography and Language Model for breast cancer classification and risk prediction, trained on a dataset of paired mammogram images and synthetic radiology reports. Our MV-MLM leverages multi-view supervision to learn rich representations from extensive radiology data by employing cross-modal self-supervision across image-text pairs. This includes multiple views and the corresponding pseudo-radiology reports. We propose a novel joint visual-textual learning strategy to enhance generalization and accuracy performance over different data types and tasks to distinguish breast tissues or cancer characteristics(calcification, mass) and utilize these patterns to understand mammography images and predict cancer risk. We evaluated our method on both private and publicly available datasets, demonstrating that the proposed model achieves state-of-the-art performance in three classification tasks: (1) malignancy classification, (2) subtype classification, and (3) image-based cancer risk prediction. Furthermore, the model exhibits strong data efficiency, outperforming existing fully supervised or VLM baselines while trained on synthetic text reports and without the need for actual radiology reports.

MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction

TL;DR

MV-MLM addresses data scarcity in mammography by marrying high-resolution multi-view images with synthetic radiology reports generated from tabular metadata through a multi-view vision-language contrastive framework. The model learns robust cross-modal representations via image-text and multi-view contrastive losses, culminating in a MV-CLIP objective that improves malignancy, mass and calcification classification, and breast cancer risk prediction while using synthetic text and without real radiology reports. It achieves state-of-the-art performance on public datasets (VinDr-Mammo and RSNA-Mammo) and demonstrates strong data efficiency and generalization, including linear-probe transfer from synthetic text. The approach reduces reliance on manually annotated reports and has potential clinical impact by enabling scalable, accurate breast cancer screening and risk assessment.

Abstract

Large annotated datasets are essential for training robust Computer-Aided Diagnosis (CAD) models for breast cancer detection or risk prediction. However, acquiring such datasets with fine-detailed annotation is both costly and time-consuming. Vision-Language Models (VLMs), such as CLIP, which are pre-trained on large image-text pairs, offer a promising solution by enhancing robustness and data efficiency in medical imaging tasks. This paper introduces a novel Multi-View Mammography and Language Model for breast cancer classification and risk prediction, trained on a dataset of paired mammogram images and synthetic radiology reports. Our MV-MLM leverages multi-view supervision to learn rich representations from extensive radiology data by employing cross-modal self-supervision across image-text pairs. This includes multiple views and the corresponding pseudo-radiology reports. We propose a novel joint visual-textual learning strategy to enhance generalization and accuracy performance over different data types and tasks to distinguish breast tissues or cancer characteristics(calcification, mass) and utilize these patterns to understand mammography images and predict cancer risk. We evaluated our method on both private and publicly available datasets, demonstrating that the proposed model achieves state-of-the-art performance in three classification tasks: (1) malignancy classification, (2) subtype classification, and (3) image-based cancer risk prediction. Furthermore, the model exhibits strong data efficiency, outperforming existing fully supervised or VLM baselines while trained on synthetic text reports and without the need for actual radiology reports.

Paper Structure

This paper contains 13 sections, 4 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Overview of our proposed Multi-View Mammography-Language Model(MV-MLM) learning for breast cancer screening applications, optimized using objective functions: multi-view visual feature alignment, vision-language contrastive learning by using feature tokenization and aggregation. The model integrates multi-modal inputs, including multi-view mammography exams and synthetic radiology reports, to improve diagnostic and prediction performance in four tasks relevant to breast cancer screening: mass, calcification, malignancy classification, and breast cancer risk prediction.