MEDBind: Unifying Language and Multimodal Medical Data Embeddings

Yuan Gao; Sangwook Kim; David E Austin; Chris McIntosh

MEDBind: Unifying Language and Multimodal Medical Data Embeddings

Yuan Gao, Sangwook Kim, David E Austin, Chris McIntosh

TL;DR

MEDBind is presented, which learns joint embeddings across CXR, ECG, and medical text, and features tri-modality binding, and can improve downstream tasks by directly integrating CXR and ECG embeddings into a large-language model for multimodal prompt tuning.

Abstract

Medical vision-language pretraining models (VLPM) have achieved remarkable progress in fusing chest X-rays (CXR) with clinical texts, introducing image-text data binding approaches that enable zero-shot learning and downstream clinical tasks. However, the current landscape lacks the holistic integration of additional medical modalities, such as electrocardiograms (ECG). We present MEDBind (Medical Electronic patient recorD), which learns joint embeddings across CXR, ECG, and medical text. Using text data as the central anchor, MEDBind features tri-modality binding, delivering competitive performance in top-K retrieval, zero-shot, and few-shot benchmarks against established VLPM, and the ability for CXR-to-ECG zero-shot classification and retrieval. This seamless integration is achieved through combination of contrastive loss on modality-text pairs with our proposed contrastive loss function, Edge-Modality Contrastive Loss, fostering a cohesive embedding space for CXR, ECG, and text. Finally, we demonstrate that MEDBind can improve downstream tasks by directly integrating CXR and ECG embeddings into a large-language model for multimodal prompt tuning.

MEDBind: Unifying Language and Multimodal Medical Data Embeddings

TL;DR

Abstract

Paper Structure (12 sections, 2 equations, 4 figures, 8 tables)

This paper contains 12 sections, 2 equations, 4 figures, 8 tables.

Introduction
Methods and Materials
Model Architecture
Loss function
ECG-CLIP and Tri-modality Evaluations
Experiments and Results
Datasets
Modality-to-Text and Cross-Modality Retrieval
Zero/Few-shot and Cross-Modality Classification
Multimodal LLM Integration
Conclusion
Appendix

Figures (4)

Figure 1: Proposed method. Batch size $n$: CXR (green), ECG (purple), and paired text (blue). Subset size $m$: paired ECG/CXR. Inputs are embedded and normalized ($\blacktriangleright$). We used two losses: 1) Text-Modality Contrastive Loss (TMCL); 2) Edge-Modality Contrastive Loss (EMCL). Grey is positive-pair; light grey is additional related pairs.
Figure 2: Embedding visualization and CXR-to-ECG cross-modality retrieval. (Left) t-SNE plots of CXR and ECG embeddings for various models. (Right) Cross-modality retrieval Top-K recall. MEDBindBD brings CXR and ECG clusters closer in t-SNE and tops cross-modality recall@{1,5,10}. $^*$CXR VLPM with ECG-CLIP as encoder zoo.
Figure 3: Results of zero-shot (denoted as astericks (*) on y-axis) and few-shot (K={1,2,4,8,16}) classification using balanced accuracy (%) on CXR (green): COVID and RSNA datasets, and ECG (purple): PTB-XL and ICBEB datasets.
Figure A.1: Three different training paradigms for downstream LLM tasks. 1) Text-only: Traditional method of prompt tuning using LoRAhu2021lora to tune weights of BioBERTlee2020biobert. 2) Encoder Zoo: AnyMALmoon2023anymal paradigm for fine-tuning, which incorporates multiple modalities by inputting CXR and ECG tokens—generated either from CXR-CLIPyou2023_cxrclip and ECG-CLIP or MedCLIPwang2022medclip and ECG-CLIP alongside clinical text. 3) MEDBind: which is a unified model for multimodal binding.

MEDBind: Unifying Language and Multimodal Medical Data Embeddings

TL;DR

Abstract

MEDBind: Unifying Language and Multimodal Medical Data Embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (4)