MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder

Lei Li; Tianfang Zhang; Xinglin Zhang; Jiaqi Liu; Bingqi Ma; Yan Luo; Tao Chen

MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder

Lei Li, Tianfang Zhang, Xinglin Zhang, Jiaqi Liu, Bingqi Ma, Yan Luo, Tao Chen

TL;DR

MedFLIP tackles data scarcity and high computational demands in medical image analysis by marrying Masked Autoencoders with language supervision in a fast cross-domain pretraining pipeline. It introduces a Medical-SVD loss, defined on the top singular value $\sigma_1$ of the image-text similarity matrix $S$, to enforce structural preservation in medical imagery, and scales masking to boost efficiency while keeping the text stream intact. Empirically, MedFLIP yields superior zero-shot and image-text retrieval performance on CheXpert-5x200 and related benchmarks compared with MedCLIP, ConVIRT, and GLoRIA, with faster pretraining. The approach advances practical medical diagnostics under data constraints and provides a theoretical generalization bound under i.i.d. assumptions, highlighting its potential for rapid, accurate multimodal analysis in healthcare.

Abstract

Within the domain of medical analysis, extensive research has explored the potential of mutual learning between Masked Autoencoders(MAEs) and multimodal data. However, the impact of MAEs on intermodality remains a key challenge. We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis. We explore MAEs for zero-shot learning with crossed domains, which enhances the model's ability to learn from limited data, a common scenario in medical diagnostics. We verify that masking an image does not affect inter-modal learning. Furthermore, we propose the SVD loss to enhance the representation learning for characteristics of medical images, aiming to improve classification accuracy by leveraging the structural intricacies of such data. Our theory posits that masking encourages semantic preservation, robust feature extraction, regularization, domain adaptation, and invariance learning. Lastly, we validate using language will improve the zero-shot performance for the medical image analysis. MedFLIP's scaling of the masking process marks an advancement in the field, offering a pathway to rapid and precise medical image analysis without the traditional computational bottlenecks. Through experiments and validation, MedFLIP demonstrates efficient performance improvements, helps for future research and application in medical diagnostics.

MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder

TL;DR

of the image-text similarity matrix

, to enforce structural preservation in medical imagery, and scales masking to boost efficiency while keeping the text stream intact. Empirically, MedFLIP yields superior zero-shot and image-text retrieval performance on CheXpert-5x200 and related benchmarks compared with MedCLIP, ConVIRT, and GLoRIA, with faster pretraining. The approach advances practical medical diagnostics under data constraints and provides a theoretical generalization bound under i.i.d. assumptions, highlighting its potential for rapid, accurate multimodal analysis in healthcare.

Abstract

Paper Structure (21 sections, 2 theorems, 13 equations, 3 figures, 3 tables)

This paper contains 21 sections, 2 theorems, 13 equations, 3 figures, 3 tables.

Introduction
Related Work
Masked Autoencoders (MAEs).
Multimodal learning.
Zero learning.
Methods
Overview
Masked Autoencoder
Fusion Module
Theoretical Analysis
Experiments
Datasets and Experiment Details
Classification
Implementation for Classification.
Image-Text Retrieval.
...and 6 more sections

Key Result

Lemma 1

Let $\hat{y}_{ij}$ be the predicted normalized similarity between the $i$-th image embedding and $j$-th text embedding in a batch (Eq. 5 in the paper). Let $y^*_{ij}$ be the optimal normalized similarity that minimizes the MedFLIP loss $L_{MedFLIP}$ (Eq. 7). Then with probability at least $1-\delta$ where $\tau$ is the temperature hyperparameter and $\beta$ is the SVD loss weight.

Figures (3)

Figure 1: Our proposed method, MedFLIP, demonstrates a superior trade-off between training efficiency and accuracy compared to the MedCLIP method. Notably, MedFLIP achieves higher accuracy on the CheXpert-5x200 irvin2019chexpert validation set with zero-shift evaluation, while maintaining a consistent model size.
Figure 2: The MedFLIP workflow involves a Knowledge Extraction module responsible for discerning medical entities within the original medical report. This process entails the extraction of pertinent medical entities from the text. Subsequently, a masking operation is executed on the associated image using a randomly generated mask. Following this, a semantic similarity matrix is constructed by juxtaposing the extracted medical entities derived from the text with the original labels delineated in the image.
Figure 3: In the context of zero-shot learning, our proposed method, MedFLIP, exhibits superior performance compared to established approaches, namely MedCLIP, ConVIRT zhang2022contrastive, and GLoRIA huang2021gloria. This is particularly evident when utilizing limited pretraining data. While ConVIRT and GLoRIA leverage the MIMIC-CXR (369K) and CheXpert (191K) datasets, respectively, MedFLIP demonstrates demonstrably improved performance.

Theorems & Definitions (4)

Lemma 1
proof
Theorem 1: MedFLIP Generalization Bound
proof

MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder

TL;DR

Abstract

MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (4)