MeDSLIP: Medical Dual-Stream Language-Image Pre-training with Pathology-Anatomy Semantic Alignment
Wenrui Fan, Mohammod N. I. Suvon, Shuo Zhou, Xianyuan Liu, Samer Alabed, Venet Osmani, Andrew J. Swift, Chen Chen, Haiping Lu
TL;DR
MeDSLIP tackles the entanglement of pathology and anatomy semantics in medical imaging by introducing a dual-stream framework that disentangles these semantics in both images and reports. It combines a disentanglement module, domain-informed text prompts, and an interaction modeling block with ProtoCL and ICL to capture cross-stream relationships, optimizing with $L = L_{Exist} + oldsymbol{\alpha} L_{ProtoCL} + oldsymbol{\beta} L_{ICL}$. Evaluated on chest X-ray benchmarks (NIH CXR14, RSNA Pneumonia, SIIM-ACR Pneumothorax, COVIDx CXR-4), MeDSLIP achieves strong zero-shot and fine-tuning performance, including unseen diseases like COVID-19, and exhibits robust grounding and segmentation capabilities. Ablation studies confirm the contributions of disentanglement, ProtoCL, and ICL, while the authors provide public code and pre-trained weights to support deployment and further research.
Abstract
Pathology and anatomy are two essential groups of semantics in medical data. Pathology describes what the diseases are, while anatomy explains where the diseases occur. They describe diseases from different perspectives, providing complementary insights into diseases. Thus, properly understanding these semantics and their relationships can enhance medical vision-language models (VLMs). However, pathology and anatomy semantics are usually entangled in medical data, hindering VLMs from explicitly modeling these semantics and their relationships. To address this challenge, we propose MeDSLIP, a novel Medical Dual-Stream Language-Image Pre-training pipeline, to disentangle pathology and anatomy semantics and model the relationships between them. We introduce a dual-stream mechanism in MeDSLIP to explicitly disentangle medical semantics into pathology-relevant and anatomy-relevant streams and align visual and textual information within each stream. Furthermore, we propose an interaction modeling module with prototypical contrastive learning loss and intra-image contrastive learning loss to regularize the relationships between pathology and anatomy semantics. We apply MeDSLIP to chest X-ray analysis and conduct comprehensive evaluations with four benchmark datasets: NIH CXR14, RSNA Pneumonia, SIIM-ACR Pneumothorax, and COVIDx CXR-4. The results demonstrate MeDSLIP's superior generalizability and transferability across different scenarios. The code is available at https://github.com/Shef-AIRE/MeDSLIP, and the pre-trained model is released at https://huggingface.co/pykale/MeDSLIP.
