Table of Contents
Fetching ...

Enhancing zero-shot learning in medical imaging: integrating clip with advanced techniques for improved chest x-ray analysis

Prakhar Bhardwaj, Sheethal Bhat, Andreas Maier

TL;DR

MoCoCLIP addresses the limited labeled data problem in chest X-ray zero-shot learning by fusing Momentum Contrast with CLIP to learn robust image representations aligned with radiology-text prompts. It introduces a momentum encoder and a large negative queue, enabling effective contrastive learning at practical batch sizes and mitigating class-imbalance effects. On NIH CXR14, MoCoCLIP achieves a ~6.5% relative improvement over CheXZero, and on CheXpert it reaches an average AUC of $0.750$ vs $0.746$ for CheXZero, indicating improved generalization to unseen pathologies. Ablation studies highlight the effectiveness of the MoCo + Image-Text Contrastive Loss combination, while noting that synthetic reports and pathology-specific variability still limit maximum performance and suggest future work with real radiology reports.

Abstract

Due to the large volume of medical imaging data, advanced AI methodologies are needed to assist radiologists in diagnosing thoracic diseases from chest X-rays (CXRs). Existing deep learning models often require large, labeled datasets, which are scarce in medical imaging due to the time-consuming and expert-driven annotation process. In this paper, we extend the existing approach to enhance zero-shot learning in medical imaging by integrating Contrastive Language-Image Pre-training (CLIP) with Momentum Contrast (MoCo), resulting in our proposed model, MoCoCLIP. Our method addresses challenges posed by class-imbalanced and unlabeled datasets, enabling improved detection of pulmonary pathologies. Experimental results on the NIH ChestXray14 dataset demonstrate that MoCoCLIP outperforms the state-of-the-art CheXZero model, achieving relative improvement of approximately 6.5%. Furthermore, on the CheXpert dataset, MoCoCLIP demonstrates superior zero-shot performance, achieving an average AUC of 0.750 compared to CheXZero with 0.746 AUC, highlighting its enhanced generalization capabilities on unseen data.

Enhancing zero-shot learning in medical imaging: integrating clip with advanced techniques for improved chest x-ray analysis

TL;DR

MoCoCLIP addresses the limited labeled data problem in chest X-ray zero-shot learning by fusing Momentum Contrast with CLIP to learn robust image representations aligned with radiology-text prompts. It introduces a momentum encoder and a large negative queue, enabling effective contrastive learning at practical batch sizes and mitigating class-imbalance effects. On NIH CXR14, MoCoCLIP achieves a ~6.5% relative improvement over CheXZero, and on CheXpert it reaches an average AUC of vs for CheXZero, indicating improved generalization to unseen pathologies. Ablation studies highlight the effectiveness of the MoCo + Image-Text Contrastive Loss combination, while noting that synthetic reports and pathology-specific variability still limit maximum performance and suggest future work with real radiology reports.

Abstract

Due to the large volume of medical imaging data, advanced AI methodologies are needed to assist radiologists in diagnosing thoracic diseases from chest X-rays (CXRs). Existing deep learning models often require large, labeled datasets, which are scarce in medical imaging due to the time-consuming and expert-driven annotation process. In this paper, we extend the existing approach to enhance zero-shot learning in medical imaging by integrating Contrastive Language-Image Pre-training (CLIP) with Momentum Contrast (MoCo), resulting in our proposed model, MoCoCLIP. Our method addresses challenges posed by class-imbalanced and unlabeled datasets, enabling improved detection of pulmonary pathologies. Experimental results on the NIH ChestXray14 dataset demonstrate that MoCoCLIP outperforms the state-of-the-art CheXZero model, achieving relative improvement of approximately 6.5%. Furthermore, on the CheXpert dataset, MoCoCLIP demonstrates superior zero-shot performance, achieving an average AUC of 0.750 compared to CheXZero with 0.746 AUC, highlighting its enhanced generalization capabilities on unseen data.

Paper Structure

This paper contains 16 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Training pipeline with MoCo integration into the baseline CLIP.