Table of Contents
Fetching ...

Features Fusion for Dual-View Mammography Mass Detection

Arina Varlamova, Valery Belotsky, Grigory Novikov, Anton Konushin, Evgeny Sidorov

TL;DR

The paper addresses the challenge of leveraging both views in dual-view mammography for mass detection. It introduces MAMM-Net, which fuses CC and MLO features at the pixel level via a Fusion Layer based on deformable attention, integrated with a View-Interactive Transformer Decoder and a Lesion Linker to establish cross-view correspondences and malignancy predictions. Key contributions include the Fusion Layer, feature-level fusion across views, and state-of-the-art results on the DDSM dataset with $R@0.25=81.6$, $R@0.5=87.9$, $R@1.0=90.6$, plus malignancy metrics (ROC-AUC $=85.3$, sensitivity $=80.2$, specificity $=76.2$). The approach reduces false positives while retaining high recall, aligning with radiologists’ two-view reasoning and enhancing potential clinical utility for computer-aided diagnosis.

Abstract

Detection of malignant lesions on mammography images is extremely important for early breast cancer diagnosis. In clinical practice, images are acquired from two different angles, and radiologists can fully utilize information from both views, simultaneously locating the same lesion. However, for automatic detection approaches such information fusion remains a challenge. In this paper, we propose a new model called MAMM-Net, which allows the processing of both mammography views simultaneously by sharing information not only on an object level, as seen in existing works, but also on a feature level. MAMM-Net's key component is the Fusion Layer, based on deformable attention and designed to increase detection precision while keeping high recall. Our experiments show superior performance on the public DDSM dataset compared to the previous state-of-the-art model, while introducing new helpful features such as lesion annotation on pixel-level and classification of lesions malignancy.

Features Fusion for Dual-View Mammography Mass Detection

TL;DR

The paper addresses the challenge of leveraging both views in dual-view mammography for mass detection. It introduces MAMM-Net, which fuses CC and MLO features at the pixel level via a Fusion Layer based on deformable attention, integrated with a View-Interactive Transformer Decoder and a Lesion Linker to establish cross-view correspondences and malignancy predictions. Key contributions include the Fusion Layer, feature-level fusion across views, and state-of-the-art results on the DDSM dataset with , , , plus malignancy metrics (ROC-AUC , sensitivity , specificity ). The approach reduces false positives while retaining high recall, aligning with radiologists’ two-view reasoning and enhancing potential clinical utility for computer-aided diagnosis.

Abstract

Detection of malignant lesions on mammography images is extremely important for early breast cancer diagnosis. In clinical practice, images are acquired from two different angles, and radiologists can fully utilize information from both views, simultaneously locating the same lesion. However, for automatic detection approaches such information fusion remains a challenge. In this paper, we propose a new model called MAMM-Net, which allows the processing of both mammography views simultaneously by sharing information not only on an object level, as seen in existing works, but also on a feature level. MAMM-Net's key component is the Fusion Layer, based on deformable attention and designed to increase detection precision while keeping high recall. Our experiments show superior performance on the public DDSM dataset compared to the previous state-of-the-art model, while introducing new helpful features such as lesion annotation on pixel-level and classification of lesions malignancy.
Paper Structure (21 sections, 1 equation, 3 figures, 2 tables)

This paper contains 21 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: General overview of MAMM-Net: 1) Two different views are processed by a shared backbone independently; 2) Generated feature maps are processed by Fusion Pixel Decoder, which provides fused feature maps for View-Interactive Transformer Decoder's masked attention and feature maps of high resolution of both views for masks generation; 3) View-Interactive Transformer Decoder (VITD), consisting of blocks of masked-, self- and inter-attention, which outputs object queries, masks for both CC and MLO view, classification of found objects along with their malignancy scores; 4) Lesion Linker uses object queries from VITD to set correspondence between objects in CC and MLO views and outputs triplets of embeddings and pair classification.
  • Figure 2: Architecture of Fusion Pixel Decoder (upper) and Fusion Layer (lower). 1) Fusion Pixel Decoder: The module uses feature maps of different resolutions for both CC and MLO views. Starting from the lowest resolution, feature maps are fused into each other using a special Fusion Layer and then are combined in a FPN manner. Fused feature maps of low resolution are transferred to the VITD to use in masked attention. The last fused feature map is used to generate masks of high resolution; 2) Fusion Layer: the main feature map (Q in Fusion Pixel Decoder) is used as queries in the multi-head attention module. Key and values are sampled from the reference feature map (K, V in Fusion Pixel Decoder). Generated queries, keys, and values are processed by a multi-head attention block.
  • Figure 3: Example of prediction (blue) and ground true (green). Intersection over Union for those two objects equals 0.15 although contours clearly indicate the same object.