MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Qiuhui Chen; Xinyue Hu; Zirui Wang; Yi Hong

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Qiuhui Chen, Xinyue Hu, Zirui Wang, Yi Hong

TL;DR

The paper tackles multimodal CAD in neurology by fusing 3D brain MRI with textual EHR data. It introduces MedBLIP, a bootstrapped VLP that uses a MedQFormer to align 3D medical images with a frozen 2D vision encoder and a frozen large language model, with parameter-efficient training via Frozen LM or LoRA and ITC-based alignment losses. The model is trained on over 30,000 MRI volumes from five public AD datasets and achieves state-of-the-art zero-shot classification across healthy controls, MCI, and AD, while also demonstrating zero-shot medical VQA capabilities. The work offers a practical, low-compute path to multimodal medical reasoning and suggests avenues for expanding modalities and longitudinal analyses.

Abstract

Vision-language pre-training (VLP) models have been demonstrated to be effective in many computer vision applications. In this paper, we consider developing a VLP model in the medical domain for making computer-aided diagnoses (CAD) based on image scans and text descriptions in electronic health records, as done in practice. To achieve our goal, we present a lightweight CAD system MedBLIP, a new paradigm for bootstrapping VLP from off-the-shelf frozen pre-trained image encoders and frozen large language models. We design a MedQFormer module to bridge the gap between 3D medical images and 2D pre-trained image encoders and language models as well. To evaluate the effectiveness of our MedBLIP, we collect more than 30,000 image volumes from five public Alzheimer's disease (AD) datasets, i.e., ADNI, NACC, OASIS, AIBL, and MIRIAD. On this largest AD dataset we know, our model achieves the SOTA performance on the zero-shot classification of healthy, mild cognitive impairment (MCI), and AD subjects, and shows its capability of making medical visual question answering (VQA). The code and pre-trained models is available online: https://github.com/Qybc/MedBLIP.

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

TL;DR

Abstract

Paper Structure (11 sections, 5 equations, 3 figures, 5 tables)

This paper contains 11 sections, 5 equations, 3 figures, 5 tables.

Introduction
Related Works
MedBLIP
Problem Formulation
Network Framework
MedQFormer
Training MedBLIP
Experiments
Datasets and Experimental Settings
Experimental Results
Discussion and Conclusion

Figures (3)

Figure 1: Architecture overview of our proposed MedBLIP, a CAD system designed for medical diagnosis with electronic health records via multimodel representation learning in a language model.
Figure 2: Illustration of our proposed MedQformer that aligns 3D visual and textural features for learning in the unified latent space of language model.
Figure 3: Samples of zero-shot results on the AIBL dataset, which are generated by our MedBLIP built on BioMedLM with LoRA fine-tuning.

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

TL;DR

Abstract

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Authors

TL;DR

Abstract

Table of Contents

Figures (3)