Improving Medical Multi-modal Contrastive Learning with Expert Annotations
Yogesh Kumar, Pekka Marttinen
TL;DR
Medical multimodal understanding is hampered by scarce expert-annotated data and a modality gap between image and text embeddings. The authors propose eCLIP, an expert-annotated extension of CLIP that injects radiologist eye-gaze heatmaps through a heatmap processor, plus mixup and curriculum strategies, while preserving core CLIP architecture. Across zero-shot classification, linear probing, cross-modal retrieval, and RAG-based radiology report generation, eCLIP yields improved alignment, uniformity, and reduced modality gap, translating into stronger cross-modal representations. The work demonstrates that high-quality expert annotations can substantially boost medical imaging multimodal learning and opens avenues for sequence-aware and text-side annotation extensions.
Abstract
We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in contrastive multi-modal medical imaging analysis, notably data scarcity and the "modality gap" -- a significant disparity between image and text embeddings that diminishes the quality of representations and hampers cross-modal interoperability. eCLIP integrates a heatmap processor and leverages mixup augmentation to efficiently utilize the scarce expert annotations, thus boosting the model's learning effectiveness. eCLIP is designed to be generally applicable to any variant of CLIP without requiring any modifications of the core architecture. Through detailed evaluations across several tasks, including zero-shot inference, linear probing, cross-modal retrieval, and Retrieval Augmented Generation (RAG) of radiology reports using a frozen Large Language Model, eCLIP showcases consistent improvements in embedding quality. The outcomes reveal enhanced alignment and uniformity, affirming eCLIP's capability to harness high-quality annotations for enriched multi-modal analysis in the medical imaging domain.
