Table of Contents
Fetching ...

MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report

Samrajya Thapa, Koushik Howlader, Subhankar Bhattacharjee, Wei le

TL;DR

A novel Multi-Modal Contrastive Pre-training Framework that synergistically combines X-rays, electrocardiograms, and radiology/cardiology reports, and is the first to propose an integrated model that combines X-ray, ECG, and Radiology/Cardiology Report with this approach.

Abstract

In this paper, we introduce a novel Multi-Modal Contrastive Pre-training Framework that synergistically combines X-rays, electrocardiograms (ECGs), and radiology/cardiology reports. Our approach leverages transformers to encode these diverse modalities into a unified representation space, aiming to enhance diagnostic accuracy and facilitate comprehensive patient assessments. We utilize LoRA-Peft to significantly reduce trainable parameters in the LLM and incorporate recent linear attention dropping strategy in the Vision Transformer(ViT) for smoother attention. Furthermore, we provide novel multimodal attention explanations and retrieval for our model. To the best of our knowledge, we are the first to propose an integrated model that combines X-ray, ECG, and Radiology/Cardiology Report with this approach. By utilizing contrastive loss, MoRE effectively aligns modality-specific features into a coherent embedding, which supports various downstream tasks such as zero-shot classification and multimodal retrieval. Employing our proposed methodology, we achieve state-of-the-art (SOTA) on the Mimic-IV, CheXpert, Edema Severity, and PtbXl downstream datasets, surpassing existing multimodal approaches. Our proposed framework shows significant improvements in capturing intricate inter-modal relationships and its robustness in medical diagnosis that establishes a framework for future research in multimodal learning in the healthcare sector.

MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report

TL;DR

A novel Multi-Modal Contrastive Pre-training Framework that synergistically combines X-rays, electrocardiograms, and radiology/cardiology reports, and is the first to propose an integrated model that combines X-ray, ECG, and Radiology/Cardiology Report with this approach.

Abstract

In this paper, we introduce a novel Multi-Modal Contrastive Pre-training Framework that synergistically combines X-rays, electrocardiograms (ECGs), and radiology/cardiology reports. Our approach leverages transformers to encode these diverse modalities into a unified representation space, aiming to enhance diagnostic accuracy and facilitate comprehensive patient assessments. We utilize LoRA-Peft to significantly reduce trainable parameters in the LLM and incorporate recent linear attention dropping strategy in the Vision Transformer(ViT) for smoother attention. Furthermore, we provide novel multimodal attention explanations and retrieval for our model. To the best of our knowledge, we are the first to propose an integrated model that combines X-ray, ECG, and Radiology/Cardiology Report with this approach. By utilizing contrastive loss, MoRE effectively aligns modality-specific features into a coherent embedding, which supports various downstream tasks such as zero-shot classification and multimodal retrieval. Employing our proposed methodology, we achieve state-of-the-art (SOTA) on the Mimic-IV, CheXpert, Edema Severity, and PtbXl downstream datasets, surpassing existing multimodal approaches. Our proposed framework shows significant improvements in capturing intricate inter-modal relationships and its robustness in medical diagnosis that establishes a framework for future research in multimodal learning in the healthcare sector.

Paper Structure

This paper contains 30 sections, 4 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: MultiModal Pretraining Framework. We join the diagnostic report of both modalities as a single input and align the modalities with contrastive loss. We employ DropToken algorithm in our ViT encoders and custom patch embedding for ECG signal modality. The LLM is only fine-tuned with LoRA PEFT effectively training 0.6% of its total parameters.
  • Figure 2: t-SNE plot of X-ray features of dataset Chexpert (top) and Mimic (bottom) from Models a: MoRE, b: GLoRIA, and c: MedKLIP.
  • Figure 3: Retrieved X-ray of Query1: "Cardiomegaly is severe"
  • Figure 4: Retrieved X-ray of Query2: "There is presence of Edema and Effusion"
  • Figure 5: X-ray-ECG Retrieval with its original associated Text
  • ...and 3 more figures