Table of Contents
Fetching ...

ProbMed: A Probabilistic Framework for Medical Multimodal Binding

Yuan Gao, Sangwook Kim, Jianzhong You, Chris McIntosh

TL;DR

ProbMED introduces a probabilistic framework for binding four medical modalities—Chest X-ray (CXR), Electrocardiogram (ECG), Echocardiogram (ECHO), and clinical text—by representing each modality as a Gaussian distribution $Z_m \sim \mathcal{N}(\mu_m, \operatorname{diag}(\sigma_m^2))$ and aligning them in a shared embedding space via probabilistic contrastive learning. The method employs an InfoNCE objective with a Hellinger distance-based similarity $PS(q_n,k_t)=1-\sqrt{H^2(q_n,k_t)}$, a Synthetic Instance Sampling (SIS) loss to strengthen within-modality binding, and a Variational Information Bottleneck to prevent variance collapse, with a final loss combining these components across randomly sampled modality pairs. Extensive experiments on 13 medical datasets show ProbMED outperforms current Med-VLPMs in cross-modality retrieval, zero-shot, and few-shot classification, and demonstrates enhanced prognostication when integrating multiple modalities. The study also highlights practical benefits of probabilistic modeling, such as uncertainty-based prompt filtering and distribution-based sampling for data-scarce scenarios, underscoring ProbMED’s potential to improve multimodal clinical decision support. Overall, ProbMED advances medical multimodal learning by explicitly modeling uncertainty and many-to-many modality mappings, enabling robust, interpretable cross-modal reasoning in real-world healthcare settings.

Abstract

Medical decision-making requires integrating diverse medical information, from imaging to clinical narratives. These medical modalities are often acquired in a many-to-many manner. However, current medical vision-language pretraining models (Med-VLPMs) fail to directly account for this many-to-many mapping in their model training and embeddings. To address this, we present Probabilistic Modality-Enhanced Diagnosis (ProbMED), a multimodal Med-VLPM that employs probabilistic contrastive learning to model distributions over embeddings rather than deterministic estimates. ProbMED aligns four distinct modalities -- chest X-rays, electrocardiograms, echocardiograms, and clinical text -- into a unified probabilistic embedding space. We use InfoNCE loss with Hellinger distance to integrate inter-modality distributions. We introduce a probabilistic synthetic sampling loss that captures modality-specific mean and variance to improve intra-modality binding. Extensive experiments across 13 medical datasets demonstrate that our model outperforms current Med-VLPMs in cross-modality retrieval, zero-shot, and few-shot classification. We also demonstrate the robust integration of multiple modalities for prognostication, showing improved intra- and inter-medical modality binding.

ProbMed: A Probabilistic Framework for Medical Multimodal Binding

TL;DR

ProbMED introduces a probabilistic framework for binding four medical modalities—Chest X-ray (CXR), Electrocardiogram (ECG), Echocardiogram (ECHO), and clinical text—by representing each modality as a Gaussian distribution and aligning them in a shared embedding space via probabilistic contrastive learning. The method employs an InfoNCE objective with a Hellinger distance-based similarity , a Synthetic Instance Sampling (SIS) loss to strengthen within-modality binding, and a Variational Information Bottleneck to prevent variance collapse, with a final loss combining these components across randomly sampled modality pairs. Extensive experiments on 13 medical datasets show ProbMED outperforms current Med-VLPMs in cross-modality retrieval, zero-shot, and few-shot classification, and demonstrates enhanced prognostication when integrating multiple modalities. The study also highlights practical benefits of probabilistic modeling, such as uncertainty-based prompt filtering and distribution-based sampling for data-scarce scenarios, underscoring ProbMED’s potential to improve multimodal clinical decision support. Overall, ProbMED advances medical multimodal learning by explicitly modeling uncertainty and many-to-many modality mappings, enabling robust, interpretable cross-modal reasoning in real-world healthcare settings.

Abstract

Medical decision-making requires integrating diverse medical information, from imaging to clinical narratives. These medical modalities are often acquired in a many-to-many manner. However, current medical vision-language pretraining models (Med-VLPMs) fail to directly account for this many-to-many mapping in their model training and embeddings. To address this, we present Probabilistic Modality-Enhanced Diagnosis (ProbMED), a multimodal Med-VLPM that employs probabilistic contrastive learning to model distributions over embeddings rather than deterministic estimates. ProbMED aligns four distinct modalities -- chest X-rays, electrocardiograms, echocardiograms, and clinical text -- into a unified probabilistic embedding space. We use InfoNCE loss with Hellinger distance to integrate inter-modality distributions. We introduce a probabilistic synthetic sampling loss that captures modality-specific mean and variance to improve intra-modality binding. Extensive experiments across 13 medical datasets demonstrate that our model outperforms current Med-VLPMs in cross-modality retrieval, zero-shot, and few-shot classification. We also demonstrate the robust integration of multiple modalities for prognostication, showing improved intra- and inter-medical modality binding.

Paper Structure

This paper contains 40 sections, 18 equations, 6 figures, 14 tables, 2 algorithms.

Figures (6)

  • Figure 1: The overview of ProbMED. (Left) (A) Text-modality: Given multiple medical modalities, we align each with its corresponding textual description using a contrastive learning framework. Traditional deterministic approaches represent each modality-text pair as fixed points in latent space ($z_n$ and $z_t$), whereas probabilistic ones model each modality embedding as a Gaussian distribution ($Z_n$ and $Z_t$), detailed in $\S$\ref{['sec:bindingmodal']}. (B) Non-text modality: We also bind between non-text modalities and model overlapping related modalities, where $n$ and $n'$ are different non-text modalities ($\S$\ref{['sec:bindingmodal']}). (C) Within-modality: We introduce a Synthetic Instance Sampling Loss for improved within-modality binding. Given any modality $Z_m$, we use the learned distribution to sample additional samples $z^{l}$ and $z^{l'}$, detailed in $\S$\ref{['sec:intramodal']}. (Right) ProbMED links ECG, CXR, ECHO, and text into a unified probabilistic embedding space, we trained (black solid lines) on all text-modality pairs and ECG-CXR (non-text modality) pairs. We observe emergent alignment (grey dashed lines) after training our model.
  • Figure 2: Qualitative comparison of decision boundaries on Kaggle COVID. Each plot shows data projected on the same PCA with an SVM (radial kernel) decision boundary. (A) The boundary is less reliable in a 2-way 3-shot few-shot setting. (B) ProbMED's probabilistic sampling (X-marks indicate sampled points) improves the boundary with more samples. (C) Using the full dataset. Note: two green points are overlapped in the 3-shot scenario.
  • Figure 3: Probabilistic sampling in ProbMED improves few-shot performance. (A–C) Our probabilistic feature sampling strategy. ProbMED extracts a probabilistic distribution $Z \sim \mathcal{N}(\mu, \sigma^2)$ for inputs. We leverage the reparameterization trick to sample $16$ distinct feature embeddings, capturing latent uncertainty. (Bottom) Few-shot AUROC comparison across 8 datasets. Sampling-based embeddings (colored lines) against using only the $\mu$ embedding (grey lines). For reference, grey lines is ProbMED results in Tab. \ref{['tab:cxrfewshot']}, \ref{['tab:ecgfewshot']}, and \ref{['tab:echofewshot']}.
  • Figure 4: ProbMED model architecture. Encoders follow standard models. The proposed method of extracting the $\mu$ and $\log(\sigma^2)$ follows PCME++ chun2023improvedpcmepp. BN represents BatchNorm1d and GPO is the Generalized Pooling Operator chen2021learningGPO.
  • Figure 5: Qualitative visualization of ProbMED embeddings using PCA for dimensionality reduction. We plot a CXR embedding (depicting pneumonia, a CXR sampled from RSNA Pneumonia dataset) alongside two distinct TXT embeddings of "Cloudy patch in the left lung" and "CXR has pneumonia". In the diagram, probabilistic embeddings provide the distributions of each embedding, while the deterministic embeddings provide the limited interpretation of the ambiguity. Best viewed in color.
  • ...and 1 more figures