MIRAM: Masked Image Autoencoders Across Multiple Scales with Hybrid-Attention Mechanism for Breast Lesion Risk Prediction
Hung Q. Vo, Pengyu Yuan, Zheng Yin, Kelvin K. Wong, Chika F. Ezeana, Son T. Ly, Stephen T. C. Wong, Hien V. Nguyen
TL;DR
This work tackles the challenge of limited annotated medical data for breast lesion analysis by introducing MIRAM, a self-supervised framework that performs masked image reconstruction across multiple scales. By reconstructing both original and higher-resolution mammogram scales and employing multiple decoders with token-interpolation, MIRAM learns finer-grained, robust representations capable of improving pathology classification and mass margins tasks on CBIS-DDSM. The approach outperforms state-of-the-art SSL methods, with gains in AP and AUC, and demonstrates scalable efficiency through linear self-attention options for the high-resolution decoder. The study further shows that using lesion-level annotated data during pre-training yields the best downstream performance, highlighting the importance of targeted SSL data curation in medical imaging. Overall, MIRAM offers a practical, scalable SSL pathway for enhancing breast lesion risk prediction from mammograms.
Abstract
Self-supervised learning (SSL) has garnered substantial interest within the machine learning and computer vision communities. Two prominent approaches in SSL include contrastive-based learning and self-distillation utilizing cropping augmentation. Lately, masked image modeling (MIM) has emerged as a more potent SSL technique, employing image inpainting as a pretext task. MIM creates a strong inductive bias toward meaningful spatial and semantic understanding. This has opened up new opportunities for SSL to contribute not only to classification tasks but also to more complex applications like object detection and image segmentation. Building upon this progress, our research paper introduces a scalable and practical SSL approach centered around more challenging pretext tasks that facilitate the acquisition of robust features. Specifically, we leverage multi-scale image reconstruction from randomly masked input images as the foundation for feature learning. Our hypothesis posits that reconstructing high-resolution images enables the model to attend to finer spatial details, particularly beneficial for discerning subtle intricacies within medical images. The proposed SSL features help improve classification performance on the Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) dataset. In pathology classification, our method demonstrates a 3\% increase in average precision (AP) and a 1\% increase in the area under the receiver operating characteristic curve (AUC) when compared to state-of-the-art (SOTA) algorithms. Moreover, in mass margins classification, our approach achieves a 4\% increase in AP and a 2\% increase in AUC.
