Table of Contents
Fetching ...

Improving Medical Multi-modal Contrastive Learning with Expert Annotations

Yogesh Kumar, Pekka Marttinen

TL;DR

Medical multimodal understanding is hampered by scarce expert-annotated data and a modality gap between image and text embeddings. The authors propose eCLIP, an expert-annotated extension of CLIP that injects radiologist eye-gaze heatmaps through a heatmap processor, plus mixup and curriculum strategies, while preserving core CLIP architecture. Across zero-shot classification, linear probing, cross-modal retrieval, and RAG-based radiology report generation, eCLIP yields improved alignment, uniformity, and reduced modality gap, translating into stronger cross-modal representations. The work demonstrates that high-quality expert annotations can substantially boost medical imaging multimodal learning and opens avenues for sequence-aware and text-side annotation extensions.

Abstract

We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in contrastive multi-modal medical imaging analysis, notably data scarcity and the "modality gap" -- a significant disparity between image and text embeddings that diminishes the quality of representations and hampers cross-modal interoperability. eCLIP integrates a heatmap processor and leverages mixup augmentation to efficiently utilize the scarce expert annotations, thus boosting the model's learning effectiveness. eCLIP is designed to be generally applicable to any variant of CLIP without requiring any modifications of the core architecture. Through detailed evaluations across several tasks, including zero-shot inference, linear probing, cross-modal retrieval, and Retrieval Augmented Generation (RAG) of radiology reports using a frozen Large Language Model, eCLIP showcases consistent improvements in embedding quality. The outcomes reveal enhanced alignment and uniformity, affirming eCLIP's capability to harness high-quality annotations for enriched multi-modal analysis in the medical imaging domain.

Improving Medical Multi-modal Contrastive Learning with Expert Annotations

TL;DR

Medical multimodal understanding is hampered by scarce expert-annotated data and a modality gap between image and text embeddings. The authors propose eCLIP, an expert-annotated extension of CLIP that injects radiologist eye-gaze heatmaps through a heatmap processor, plus mixup and curriculum strategies, while preserving core CLIP architecture. Across zero-shot classification, linear probing, cross-modal retrieval, and RAG-based radiology report generation, eCLIP yields improved alignment, uniformity, and reduced modality gap, translating into stronger cross-modal representations. The work demonstrates that high-quality expert annotations can substantially boost medical imaging multimodal learning and opens avenues for sequence-aware and text-side annotation extensions.

Abstract

We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in contrastive multi-modal medical imaging analysis, notably data scarcity and the "modality gap" -- a significant disparity between image and text embeddings that diminishes the quality of representations and hampers cross-modal interoperability. eCLIP integrates a heatmap processor and leverages mixup augmentation to efficiently utilize the scarce expert annotations, thus boosting the model's learning effectiveness. eCLIP is designed to be generally applicable to any variant of CLIP without requiring any modifications of the core architecture. Through detailed evaluations across several tasks, including zero-shot inference, linear probing, cross-modal retrieval, and Retrieval Augmented Generation (RAG) of radiology reports using a frozen Large Language Model, eCLIP showcases consistent improvements in embedding quality. The outcomes reveal enhanced alignment and uniformity, affirming eCLIP's capability to harness high-quality annotations for enriched multi-modal analysis in the medical imaging domain.
Paper Structure (28 sections, 3 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 28 sections, 3 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Analysis of CLIP Embeddings in Medical Imaging The figure presents embeddings generated by a CLIP model, pretrained on an internet-scale dataset, applied to the Open-I dataset pairing X-rays with corresponding radiology reports.
  • Figure 1: Sample Efficiency.(top row) Zero-shot performance on three multi-label classification test sets for DACL and eCLIP Swin Tiny models, trained with varying amounts of training batches. (bottom row) Linear probe scores with varying amounts of training data for $m^3$-mixup and eCLIP Swin Tiny models.
  • Figure 2: eCLIP Pretraining with Expert Annotations. eCLIP adds a Heatmap Processor (right), featuring a multi-headed attention layer, to the standard Image and Text encoders in CLIP. This processor, along with vision and text encoders, maps inputs into a shared hypersphere. Here, the original image ($I_i$), its text ($T_i$) and the heatmap-processed image ($I^E_i$) are positioned within a tripartite area (shown here after 2D UMAP projection, please refer to the Supplement for a scaled version). We employ mixup between $I_i$ and $I^E_i$ to generate the embedding $v^{\lambda}_i$, which gives us additional positive pairs to enhance the CLIP InfoNCE loss optimization. An auxillary loss, $\mathcal{L}_{\text{priming}}$, is used during the initial training steps to "prime" the heatmap processor to imitate an identity function when the heatmap is composed of all ones.
  • Figure 2: 2D UMAP Projection of Embeddings Figure shows the UMAP projection of the Image, Text and heatmap processed Image embedding generated by eCLIP with Swin Tiny encoder. We use Open-I dataset for image and text and since expert annotation is unavailable for this dataset, we generate random uniform masks to simulate heatmaps.
  • Figure 3: Comparing eCLIP with $m^2$-mixupoh2024geodesic.(left) Standard CLIP showing image-text positive pairs $(v_i, t_i)$ (solid line), while the other image embeddings serve as negative pairs (dashed line). (center) the $m^2$-mixup creates negative pairs $(v^{\lambda}_j, t_i)$ via interpolation between embeddings along the geodesic. (right) eCLIP adds expert image embedding, $v^E_i$, in addition to $v_i$ for text $t_i$, forming additional positive and negative pairs
  • ...and 3 more figures