MeDSLIP: Medical Dual-Stream Language-Image Pre-training with Pathology-Anatomy Semantic Alignment

Wenrui Fan; Mohammod N. I. Suvon; Shuo Zhou; Xianyuan Liu; Samer Alabed; Venet Osmani; Andrew J. Swift; Chen Chen; Haiping Lu

MeDSLIP: Medical Dual-Stream Language-Image Pre-training with Pathology-Anatomy Semantic Alignment

Wenrui Fan, Mohammod N. I. Suvon, Shuo Zhou, Xianyuan Liu, Samer Alabed, Venet Osmani, Andrew J. Swift, Chen Chen, Haiping Lu

TL;DR

MeDSLIP tackles the entanglement of pathology and anatomy semantics in medical imaging by introducing a dual-stream framework that disentangles these semantics in both images and reports. It combines a disentanglement module, domain-informed text prompts, and an interaction modeling block with ProtoCL and ICL to capture cross-stream relationships, optimizing with $L = L_{Exist} + oldsymbol{\alpha} L_{ProtoCL} + oldsymbol{\beta} L_{ICL}$. Evaluated on chest X-ray benchmarks (NIH CXR14, RSNA Pneumonia, SIIM-ACR Pneumothorax, COVIDx CXR-4), MeDSLIP achieves strong zero-shot and fine-tuning performance, including unseen diseases like COVID-19, and exhibits robust grounding and segmentation capabilities. Ablation studies confirm the contributions of disentanglement, ProtoCL, and ICL, while the authors provide public code and pre-trained weights to support deployment and further research.

Abstract

Pathology and anatomy are two essential groups of semantics in medical data. Pathology describes what the diseases are, while anatomy explains where the diseases occur. They describe diseases from different perspectives, providing complementary insights into diseases. Thus, properly understanding these semantics and their relationships can enhance medical vision-language models (VLMs). However, pathology and anatomy semantics are usually entangled in medical data, hindering VLMs from explicitly modeling these semantics and their relationships. To address this challenge, we propose MeDSLIP, a novel Medical Dual-Stream Language-Image Pre-training pipeline, to disentangle pathology and anatomy semantics and model the relationships between them. We introduce a dual-stream mechanism in MeDSLIP to explicitly disentangle medical semantics into pathology-relevant and anatomy-relevant streams and align visual and textual information within each stream. Furthermore, we propose an interaction modeling module with prototypical contrastive learning loss and intra-image contrastive learning loss to regularize the relationships between pathology and anatomy semantics. We apply MeDSLIP to chest X-ray analysis and conduct comprehensive evaluations with four benchmark datasets: NIH CXR14, RSNA Pneumonia, SIIM-ACR Pneumothorax, and COVIDx CXR-4. The results demonstrate MeDSLIP's superior generalizability and transferability across different scenarios. The code is available at https://github.com/Shef-AIRE/MeDSLIP, and the pre-trained model is released at https://huggingface.co/pykale/MeDSLIP.

MeDSLIP: Medical Dual-Stream Language-Image Pre-training with Pathology-Anatomy Semantic Alignment

TL;DR

. Evaluated on chest X-ray benchmarks (NIH CXR14, RSNA Pneumonia, SIIM-ACR Pneumothorax, COVIDx CXR-4), MeDSLIP achieves strong zero-shot and fine-tuning performance, including unseen diseases like COVID-19, and exhibits robust grounding and segmentation capabilities. Ablation studies confirm the contributions of disentanglement, ProtoCL, and ICL, while the authors provide public code and pre-trained weights to support deployment and further research.

Abstract

Paper Structure (29 sections, 7 equations, 8 figures, 6 tables)

This paper contains 29 sections, 7 equations, 8 figures, 6 tables.

Introduction
Methodology
Text Processing
Image Encoding
Disentanglement Module
Semantic Vision-Language Alignment
Interaction Modeling
Prototypical Contrastive Loss (ProtoCL)
Intra-image Contrastive Loss (ICL)
Inference
Experiment Settings
Datasets
Implementation
Baselines
Metrics
...and 14 more sections

Figures (8)

Figure 1: Pipeline of Medical Dual-Stream Language-Image Pre-training (MeDSLIP). Each module is indicated with a unique color. Symbols with $\mathbf{I}$ and $\mathbf{T}$ denote image and text embeddings, respectively. $Q$ denotes query networks. $h$ denotes linear projection layers. $\mathbf{Z}$ represents outputs after linear projection. $E_I$ and $E_T$ are image and text encoders, respectively. $\mathbf{y}$ represents existence labels. The denotations with superscripts $p$ and $a$ are pathology-related and anatomy-related. a. Pipeline: Reports are processed to extract pathology and anatomy terms, generate text query embedding sets $\{\mathbf{T}^a\}_n$ and $\{\mathbf{T}^p\}_m$, and an existence label matrix, $\mathbf{y}^{a,p}$. $m$ and $n$ represent that we select top commonly seen $m$ pathology semantics and $n$ anatomy semantics in all medical reports. Images are encoded, disentangled, and aligned within corresponding streams. The interaction modeling module regularizes the interactions between pathology and anatomy semantics. b. Text Processing: (pathology, anatomy, existence) triplets are extracted from raw reports. Most commonly occurring triplets among all reports are used as query sets, which are prompted and encoded to obtain query embeddings (see Sec. \ref{['sec:text-pre-processing']}). c. Disentanglement Module: It masks raw image embeddings, disentangling pathology and anatomy embedding (Sec. \ref{['sec:disentangle-module']}). d. Semantic Alignment: A query network $Q^p$ aligns the text query set $\{\mathbf{T}^p\}_m$ with the image pathology embedding $\mathbf{I}^p$ and outputs a queried pathology embedding set $\{\mathbf{I}^p_q\}_m$. An existence predictor $p^p$ then checks whether each text semantic exists in the images. A similar alignment process is applied to anatomy semantics (see Sec. \ref{['sec:semantical-align']}). e. Interaction Modeling:$\mathcal{L}_{ICL}$ aligns unimodal, cross-stream information, while $\mathcal{L}_{ProtoCL}$ aligns cross-modal, cross-stream information (see Sec. \ref{['section:interaction-modeling']}).
Figure 2: Comparison between contrastive learning with or without prototypes, using ProtoCL between anatomy image embeddings and pathology text embeddings as an example. a. Conventional contrastive learning without prototypes. b. ProtoCL uses the prototype of all positive samples as the new positive example in contrastive learning.
Figure 3: Mechanism of intra-image contrastive loss (ICL). $a_1, \dots, a_n$ and $p_1, \dots, p_m$ indicating the semantics in query sets. Elements in $\mathbf{\hat{y}}^{a,p}$ are the predictions, and elements in $\mathbf{y}^{a,p}$ are ground truths when computing $\mathcal{L}_{ICL}$.
Figure 4: Disease-wise AUROCs of zero-shot classification on NIH CXR14 dataset cxr14 show MeDSLIP outperforms other baselines on most of the diseases. AUROCs are calculated between the positive patients of each disease and other health controls across all data.
Figure 5: Disease-wise UMAPs of $\{\mathbf{I}_q^p\}_{14}$ of NIH CXR14 dataset cxr14. Gray points in each UMAP represent $\mathbf{I}_q^p$ of healthy controls, while colored points denote $\mathbf{I}_q^p$ of patients with the corresponding disease. The diseases with higher AUROC scores in TABLE \ref{['table:zs-cls']} tend to be more distinct and well-clustered.
...and 3 more figures

MeDSLIP: Medical Dual-Stream Language-Image Pre-training with Pathology-Anatomy Semantic Alignment

TL;DR

Abstract

MeDSLIP: Medical Dual-Stream Language-Image Pre-training with Pathology-Anatomy Semantic Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (8)