Table of Contents
Fetching ...

Closing the gap in multimodal medical representation alignment

Eleonora Grassucci, Giordano Cicchetti, Danilo Comminiello

TL;DR

The paper tackles the modality gap in multimodal medical representation learning, showing that standard CLIP losses leave true radiology image-text pairs poorly aligned (cosine similarity around $0.20$ and angle $80^\circ$). It introduces a modality-agnostic framework with Align True Pairs loss $\mathcal{L}_{ATP}$ and Centroid Uniformity loss $\mathcal{L}_{CU}$, integrated with a CLIP-style objective to pull semantically related representations across modalities. On ROCO, using EVAClip-ViT-G for images and BERT-B for text with a 512-d latent space, the approach reduces the gap to $0.12$ and increases Cos True Pairs to $0.54$, while boosting Recall@10 by $7.4$ points and improving captioning metrics, demonstrating stronger cross-modal alignment and downstream performance. The method offers a practical path toward reliable, semantically consistent multimodal medical representations suitable for retrieval and captioning, with potential for broader medical-modal extension.

Abstract

In multimodal learning, CLIP has emerged as the de-facto approach for mapping different modalities into a shared latent space by bringing semantically similar representations closer while pushing apart dissimilar ones. However, CLIP-based contrastive losses exhibit unintended behaviors that negatively impact true semantic alignment, leading to sparse and fragmented latent spaces. This phenomenon, known as the modality gap, has been partially mitigated for standard text and image pairs but remains unknown and unresolved in more complex multimodal settings, such as the medical domain. In this work, we study this phenomenon in the latter case, revealing that the modality gap is present also in medical alignment, and we propose a modality-agnostic framework that closes this gap, ensuring that semantically related representations are more aligned, regardless of their source modality. Our method enhances alignment between radiology images and clinical text, improving cross-modal retrieval and image captioning.

Closing the gap in multimodal medical representation alignment

TL;DR

The paper tackles the modality gap in multimodal medical representation learning, showing that standard CLIP losses leave true radiology image-text pairs poorly aligned (cosine similarity around and angle ). It introduces a modality-agnostic framework with Align True Pairs loss and Centroid Uniformity loss , integrated with a CLIP-style objective to pull semantically related representations across modalities. On ROCO, using EVAClip-ViT-G for images and BERT-B for text with a 512-d latent space, the approach reduces the gap to and increases Cos True Pairs to , while boosting Recall@10 by points and improving captioning metrics, demonstrating stronger cross-modal alignment and downstream performance. The method offers a practical path toward reliable, semantically consistent multimodal medical representations suitable for retrieval and captioning, with potential for broader medical-modal extension.

Abstract

In multimodal learning, CLIP has emerged as the de-facto approach for mapping different modalities into a shared latent space by bringing semantically similar representations closer while pushing apart dissimilar ones. However, CLIP-based contrastive losses exhibit unintended behaviors that negatively impact true semantic alignment, leading to sparse and fragmented latent spaces. This phenomenon, known as the modality gap, has been partially mitigated for standard text and image pairs but remains unknown and unresolved in more complex multimodal settings, such as the medical domain. In this work, we study this phenomenon in the latter case, revealing that the modality gap is present also in medical alignment, and we propose a modality-agnostic framework that closes this gap, ensuring that semantically related representations are more aligned, regardless of their source modality. Our method enhances alignment between radiology images and clinical text, improving cross-modal retrieval and image captioning.
Paper Structure (8 sections, 11 equations, 2 figures, 2 tables)

This paper contains 8 sections, 11 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Illustration of the modality gap in the medical imaging domain. Triangles represent embeddings extracted by the image encoder, while diamonds represent embeddings from the text encoder. Colors indicate shared semantic meaning. In standard CLIP-based training, the resulting latent space shows a significant modality gap (i.e., embeddings from different modalities with the same meaning remain far apart). Our method introduces additional loss functions designed to reduce this gap and to align cross-modal embeddings closely based purely on the semantic meaning.
  • Figure 2: The modality gap originates at initialization, with the two modalities clearly clustered. Such a gap still persists even after training in the conventional CLIP-based learning setting, such as MedCLIP Wang2022MedCLIPCL. On the contrary, the proposed method closes the gap, leveraging the whole space to cluster embeddings according to the semantics rather than to the modality type.