Table of Contents
Fetching ...

MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval

Ahmad Elallaf, Yu Zhang, Yuktha Priya Masupalli, Jeong Yang, Young Lee, Zechun Cao, Gongbo Liang

TL;DR

MedProbCLIP tackles the reliability limitations of deterministic vision-language models in medical radiography by modeling image and report embeddings as Gaussian distributions, capturing uncertainty and many-to-many relationships. It introduces a probabilistic contrastive learning objective with a Gaussian distance measure and a variational information bottleneck, applied in a multi-view, multi-section training setup to learn distributions over two images and two reports. On MIMIC-CXR, MedProbCLIP achieves state-of-the-art retrieval and zero-shot classification performance, along with superior calibration and selective retrieval reliability, and demonstrates robustness to clinically relevant image perturbations. The work highlights the value of uncertainty-aware cross-modal representations for trustworthy radiology image-text retrieval and suggests future extensions to richer uncertainty structures and clinical decision-support integration.

Abstract

Vision-language foundation models have emerged as powerful general-purpose representation learners with strong potential for multimodal understanding, but their deterministic embeddings often fail to provide the reliability required for high-stakes biomedical applications. This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval. MedProbCLIP models image and text representations as Gaussian embeddings through a probabilistic contrastive objective that explicitly captures uncertainty and many-to-many correspondences between radiographs and clinical narratives. A variational information bottleneck mitigates overconfident predictions, while MedProbCLIP employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence, yet requires only a single radiograph and a single report at inference. Evaluated on the MIMIC-CXR dataset, MedProbCLIP outperforms deterministic and probabilistic baselines, including CLIP, CXR-CLIP, and PCME++, in both retrieval and zero-shot classification. Beyond accuracy, MedProbCLIP demonstrates superior calibration, risk-coverage behavior, selective retrieval reliability, and robustness to clinically relevant corruptions, underscoring the value of probabilistic vision-language modeling for improving the trustworthiness and safety of radiology image-text retrieval systems.

MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval

TL;DR

MedProbCLIP tackles the reliability limitations of deterministic vision-language models in medical radiography by modeling image and report embeddings as Gaussian distributions, capturing uncertainty and many-to-many relationships. It introduces a probabilistic contrastive learning objective with a Gaussian distance measure and a variational information bottleneck, applied in a multi-view, multi-section training setup to learn distributions over two images and two reports. On MIMIC-CXR, MedProbCLIP achieves state-of-the-art retrieval and zero-shot classification performance, along with superior calibration and selective retrieval reliability, and demonstrates robustness to clinically relevant image perturbations. The work highlights the value of uncertainty-aware cross-modal representations for trustworthy radiology image-text retrieval and suggests future extensions to richer uncertainty structures and clinical decision-support integration.

Abstract

Vision-language foundation models have emerged as powerful general-purpose representation learners with strong potential for multimodal understanding, but their deterministic embeddings often fail to provide the reliability required for high-stakes biomedical applications. This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval. MedProbCLIP models image and text representations as Gaussian embeddings through a probabilistic contrastive objective that explicitly captures uncertainty and many-to-many correspondences between radiographs and clinical narratives. A variational information bottleneck mitigates overconfident predictions, while MedProbCLIP employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence, yet requires only a single radiograph and a single report at inference. Evaluated on the MIMIC-CXR dataset, MedProbCLIP outperforms deterministic and probabilistic baselines, including CLIP, CXR-CLIP, and PCME++, in both retrieval and zero-shot classification. Beyond accuracy, MedProbCLIP demonstrates superior calibration, risk-coverage behavior, selective retrieval reliability, and robustness to clinically relevant corruptions, underscoring the value of probabilistic vision-language modeling for improving the trustworthiness and safety of radiology image-text retrieval systems.
Paper Structure (33 sections, 4 equations, 6 figures, 2 tables)

This paper contains 33 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Illustration of inherent many-to-many relationships in cross-modal datasets. Although MS-COCO annotates only a single caption as the positive match to one image (blue arrows), human raters often identify multiple additional plausible matches (pink dashed arrows). Such unannotated positives create false negatives that violate the one-to-one assumption commonly enforced in contrastive learning, motivating models capable of handling ambiguity and uncertainty in image-text alignment.
  • Figure 2: Example of a contrastive learning model.
  • Figure 3: Deterministic Embedding vs Probabilistic Embedding
  • Figure 4: Overview of the MedProbCLIP architecture.
  • Figure 5: Selective retrieval performance demonstrates MedProbCLIP’s superior calibration and potential for safer clinical deployment.
  • ...and 1 more figures