ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features

Xin Wei; Yaling Tao; Changde Du; Gangming Zhao; Yizhou Yu; Jinpeng Li

ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features

Xin Wei, Yaling Tao, Changde Du, Gangming Zhao, Yizhou Yu, Jinpeng Li

TL;DR

This work proposes ViKL, a innovative framework that synergizes Visual, Knowledge, and Linguistic features that substantially enhances the pathological classification and fosters multimodal interactions in mammography.

Abstract

Mammography is the primary imaging tool for breast cancer diagnosis. Despite significant strides in applying deep learning to interpret mammography images, efforts that focus predominantly on visual features often struggle with generalization across datasets. We hypothesize that integrating additional modalities in the radiology practice, notably the linguistic features of reports and manifestation features embodying radiological insights, offers a more powerful, interpretable and generalizable representation. In this paper, we announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports. Based on this dataset, we focus on the challanging task of unsupervised pretraining and propose ViKL, a innovative framework that synergizes Visual, Knowledge, and Linguistic features. This framework relies solely on pairing information without the necessity for pathology labels, which are often challanging to acquire. ViKL employs a triple contrastive learning approach to merge linguistic and knowledge-based insights with visual data, enabling both inter-modality and intra-modality feature enhancement. Our research yields significant findings: 1) Integrating reports and manifestations with unsupervised visual pretraining, ViKL substantially enhances the pathological classification and fosters multimodal interactions. 2) Manifestations can introduce a novel hard negative sample selection mechanism. 3) The multimodal features demonstrate transferability across different datasets. 4) The multimodal pretraining approach curbs miscalibrations and crafts a high-quality representation space. The MVKL dataset and ViKL code are publicly available at https://github.com/wxwxwwxxx/ViKL to support a broad spectrum of future research.

ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features

TL;DR

Abstract

Paper Structure (27 sections, 8 equations, 9 figures, 9 tables)

This paper contains 27 sections, 8 equations, 9 figures, 9 tables.

Introduction
Related Works
Dataset
Method
Model Architecture
Pretraining Objectives
Final Objective and Implementations
Experiments and Results
Evaluation Protocols
Experimental Setups
Pulic Datasets for External Evaluation
A Glance on the MVKL Dataset
Challenge in Cross-Dataset Generalization
Downstream Tasks Validation
Comparison with Supervised Models
...and 12 more sections

Figures (9)

Figure 1: Knowledge-guided diagnostic protocol of radiologists for breast cancer. They identify suspected lesions on mammography images and refer to the consensus of radiology and their experience to analyze the lesion from the perspective of manifestations, e.g., shape, margin and density. Based on these impressions, radiologists write concise and conclusive mammography reports. This process inspires the visual pretraining paradigm of ViKL, which aggregates visual, knowledge and linguistic features to build intelligent machines for mammography analysis.
Figure 2: The training (A-C) and test phases of ViKL. A. Mammography views are projected into a 128-dimensional hypersphere embedding space, normalized with $\mathcal{L}_2$ norm. Features from different views are attracted to each other, facilitating intra-modality alignment. B. Data from the three modalities are projected into the same embedding space using respective encoders. Matched modalities of the same instance are attracted to each other, ensuring inter-modality alignment. C. Features from distinct instances are repelled to enhance uniformity across the hypersphere, preserving information effectively. D. The image encoder is capable of improved pathological classification of breast lumps through a simple task head. ViKL model is also versatile, suitable for multimodal tasks like image-report retrieval and manifestation estimation.
Figure 3: The ROC curve for ViKL, along with the metrics from two radiologists. When diagnosing with single-view mammograms, their classification performance is comparable, but the probabilistic outputs from ViKL provide it with greater flexibility.
Figure 4: Confusion matrix of image-report retrieval experiment.
Figure 5: ViKL shows superior adaptability compared to the supervised IM model when fine-tuning on the MKVL dataset with reduced training data, illustrating less performance reduction as the availability of downstream data decreases.
...and 4 more figures

ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features

TL;DR

Abstract

ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features

Authors

TL;DR

Abstract

Table of Contents

Figures (9)