Table of Contents
Fetching ...

MV-Swin-T: Mammogram Classification with Multi-view Swin Transformer

Sushmita Sarker, Prithul Sarker, George Bebis, Alireza Tavakkoli

TL;DR

MV-Swin-T addresses the lack of multi-view modelling in mammography by employing a pure transformer-based network that fuses ipsilateral CC and MLO views using a Multi-Head Dynamic Attention (MDA) mechanism within fixed and shifted windows. The Omni-Attention blocks enable both self- and cross-view interactions in local windows, with fusion after stage 2 to balance context and efficiency. Evaluations on CBIS-DDSM and VinDr-Mammo show MV-Swin-T outperforms the single-view Swin-T, particularly on VinDr-Mammo and with 384×384 inputs, demonstrating the viability of fully transformer-based multi-view mammography. The work suggests future directions toward scalability to larger datasets and smoother clinical integration.

Abstract

Traditional deep learning approaches for breast cancer classification has predominantly concentrated on single-view analysis. In clinical practice, however, radiologists concurrently examine all views within a mammography exam, leveraging the inherent correlations in these views to effectively detect tumors. Acknowledging the significance of multi-view analysis, some studies have introduced methods that independently process mammogram views, either through distinct convolutional branches or simple fusion strategies, inadvertently leading to a loss of crucial inter-view correlations. In this paper, we propose an innovative multi-view network exclusively based on transformers to address challenges in mammographic image classification. Our approach introduces a novel shifted window-based dynamic attention block, facilitating the effective integration of multi-view information and promoting the coherent transfer of this information between views at the spatial feature map level. Furthermore, we conduct a comprehensive comparative analysis of the performance and effectiveness of transformer-based models under diverse settings, employing the CBIS-DDSM and Vin-Dr Mammo datasets. Our code is publicly available at https://github.com/prithuls/MV-Swin-T

MV-Swin-T: Mammogram Classification with Multi-view Swin Transformer

TL;DR

MV-Swin-T addresses the lack of multi-view modelling in mammography by employing a pure transformer-based network that fuses ipsilateral CC and MLO views using a Multi-Head Dynamic Attention (MDA) mechanism within fixed and shifted windows. The Omni-Attention blocks enable both self- and cross-view interactions in local windows, with fusion after stage 2 to balance context and efficiency. Evaluations on CBIS-DDSM and VinDr-Mammo show MV-Swin-T outperforms the single-view Swin-T, particularly on VinDr-Mammo and with 384×384 inputs, demonstrating the viability of fully transformer-based multi-view mammography. The work suggests future directions toward scalability to larger datasets and smoother clinical integration.

Abstract

Traditional deep learning approaches for breast cancer classification has predominantly concentrated on single-view analysis. In clinical practice, however, radiologists concurrently examine all views within a mammography exam, leveraging the inherent correlations in these views to effectively detect tumors. Acknowledging the significance of multi-view analysis, some studies have introduced methods that independently process mammogram views, either through distinct convolutional branches or simple fusion strategies, inadvertently leading to a loss of crucial inter-view correlations. In this paper, we propose an innovative multi-view network exclusively based on transformers to address challenges in mammographic image classification. Our approach introduces a novel shifted window-based dynamic attention block, facilitating the effective integration of multi-view information and promoting the coherent transfer of this information between views at the spatial feature map level. Furthermore, we conduct a comprehensive comparative analysis of the performance and effectiveness of transformer-based models under diverse settings, employing the CBIS-DDSM and Vin-Dr Mammo datasets. Our code is publicly available at https://github.com/prithuls/MV-Swin-T
Paper Structure (8 sections, 4 equations, 2 figures, 2 tables)

This paper contains 8 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Illustration of self and cross attention operations within the proposed Multi-head Dynamic Attention (MDA) block for ipsilateral views. The term 'matmul' denotes matrix multiplication.
  • Figure 2: (a) Our proposed multi-view architecture. (b) Two successive Omni-Attention Transformer Blocks featuring W-MDA and SW-MDA components for Multi-Head Dynamic-Attention with regular and shifted window configurations. While presented as a single-input diagram for simplicity, here, Z represents the combined representations of CC and MLO, $Z = \langle Z_{CC}, Z_{MLO} \rangle.$