Enhancing zero-shot learning in medical imaging: integrating clip with advanced techniques for improved chest x-ray analysis
Prakhar Bhardwaj, Sheethal Bhat, Andreas Maier
TL;DR
MoCoCLIP addresses the limited labeled data problem in chest X-ray zero-shot learning by fusing Momentum Contrast with CLIP to learn robust image representations aligned with radiology-text prompts. It introduces a momentum encoder and a large negative queue, enabling effective contrastive learning at practical batch sizes and mitigating class-imbalance effects. On NIH CXR14, MoCoCLIP achieves a ~6.5% relative improvement over CheXZero, and on CheXpert it reaches an average AUC of $0.750$ vs $0.746$ for CheXZero, indicating improved generalization to unseen pathologies. Ablation studies highlight the effectiveness of the MoCo + Image-Text Contrastive Loss combination, while noting that synthetic reports and pathology-specific variability still limit maximum performance and suggest future work with real radiology reports.
Abstract
Due to the large volume of medical imaging data, advanced AI methodologies are needed to assist radiologists in diagnosing thoracic diseases from chest X-rays (CXRs). Existing deep learning models often require large, labeled datasets, which are scarce in medical imaging due to the time-consuming and expert-driven annotation process. In this paper, we extend the existing approach to enhance zero-shot learning in medical imaging by integrating Contrastive Language-Image Pre-training (CLIP) with Momentum Contrast (MoCo), resulting in our proposed model, MoCoCLIP. Our method addresses challenges posed by class-imbalanced and unlabeled datasets, enabling improved detection of pulmonary pathologies. Experimental results on the NIH ChestXray14 dataset demonstrate that MoCoCLIP outperforms the state-of-the-art CheXZero model, achieving relative improvement of approximately 6.5%. Furthermore, on the CheXpert dataset, MoCoCLIP demonstrates superior zero-shot performance, achieving an average AUC of 0.750 compared to CheXZero with 0.746 AUC, highlighting its enhanced generalization capabilities on unseen data.
