Table of Contents
Fetching ...

Bringing CLIP to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis

Hanbin Ko, Chang-Min Park

TL;DR

The paper adapts CLIP-based vision-language pretraining to medical data by addressing negation and data imbalance through clinically-enhanced dynamic soft labels, negation-based hard negatives, and graph embeddings. It introduces CXR-Align, a benchmark for assessing negation handling and CXR-report alignment, and demonstrates state-of-the-art performance across zero-shot, fine-tuned classification, and report retrieval tasks. The approach shows robust improvements by integrating textual, clinical, and graphical signals, and provides empirical insights from extensive ablations and analyses. This work advances clinical language understanding in medical imaging and offers practical guidance for deploying medical VLP models in real-world settings.

Abstract

The development of large-scale image-text pair datasets has significantly advanced self-supervised learning in Vision-Language Processing (VLP). However, directly applying general-domain architectures such as CLIP to medical data presents challenges, particularly in handling negations and addressing the inherent data imbalance of medical datasets. To address these issues, we propose a novel approach that integrates clinically-enhanced dynamic soft labels and medical graphical alignment, thereby improving clinical comprehension and the applicability of contrastive loss in medical contexts. Furthermore, we introduce negation-based hard negatives to deepen the model's understanding of the complexities of clinical language. Our approach is easily integrated into the medical CLIP training pipeline and achieves state-of-the-art performance across multiple tasks, including zero-shot, fine-tuned classification, and report retrieval. To comprehensively evaluate our model's capacity for understanding clinical language, we introduce CXR-Align, a benchmark uniquely designed to evaluate the understanding of negation and clinical information within chest X-ray (CXR) datasets. Experimental results demonstrate that our proposed methods are straightforward to implement and generalize effectively across contrastive learning frameworks, enhancing medical VLP capabilities and advancing clinical language understanding in medical imaging.

Bringing CLIP to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis

TL;DR

The paper adapts CLIP-based vision-language pretraining to medical data by addressing negation and data imbalance through clinically-enhanced dynamic soft labels, negation-based hard negatives, and graph embeddings. It introduces CXR-Align, a benchmark for assessing negation handling and CXR-report alignment, and demonstrates state-of-the-art performance across zero-shot, fine-tuned classification, and report retrieval tasks. The approach shows robust improvements by integrating textual, clinical, and graphical signals, and provides empirical insights from extensive ablations and analyses. This work advances clinical language understanding in medical imaging and offers practical guidance for deploying medical VLP models in real-world settings.

Abstract

The development of large-scale image-text pair datasets has significantly advanced self-supervised learning in Vision-Language Processing (VLP). However, directly applying general-domain architectures such as CLIP to medical data presents challenges, particularly in handling negations and addressing the inherent data imbalance of medical datasets. To address these issues, we propose a novel approach that integrates clinically-enhanced dynamic soft labels and medical graphical alignment, thereby improving clinical comprehension and the applicability of contrastive loss in medical contexts. Furthermore, we introduce negation-based hard negatives to deepen the model's understanding of the complexities of clinical language. Our approach is easily integrated into the medical CLIP training pipeline and achieves state-of-the-art performance across multiple tasks, including zero-shot, fine-tuned classification, and report retrieval. To comprehensively evaluate our model's capacity for understanding clinical language, we introduce CXR-Align, a benchmark uniquely designed to evaluate the understanding of negation and clinical information within chest X-ray (CXR) datasets. Experimental results demonstrate that our proposed methods are straightforward to implement and generalize effectively across contrastive learning frameworks, enhancing medical VLP capabilities and advancing clinical language understanding in medical imaging.

Paper Structure

This paper contains 49 sections, 9 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: (a) Standard visual-language pre-training approaches using contrastive learning (e.g., InfoNCE). (b) Our approach, leveraging unique medical domain characteristics (e.g., imbalance and negations), dynamically generates soft labels based on clinical, textual, and relational similarities while integrating negations as hard negatives.
  • Figure 2: Given a CXR report, CheXbert identifies all positive entities, and one is randomly selected. A language model then (i) splits the report so each sentence contains a single clinical entity without temporal statements and (ii) removes sentences related to the selected entity. Finally, a negation for the selected entity is added at a random position within the report (beginning, middle, or end).
  • Figure 3: Overview of the proposed pipeline. Hard negative reports are created that differ from the original by only one clinical entity. Embeddings of each modality (CXR, report, graph) are extracted by their encoders, along with clinical labels from the report. Intra-modal self-similarities are computed for clinical labels, text embeddings, and graph embeddings, used as soft labels for each stream. The conventional InfoNCE loss is replaced by KL-Divergence when incorporating softened targets, ensuring labels reflect the textual, clinical, and graphical meanings correctly.
  • Figure 4: Counts of clinical entities in the whole MIMIC training set and a private dataset collected from a tertiary hospital. The private dataset comprises around 1.3 million records collected over 20 years, each from unique patients.
  • Figure 5: Counts of clinical entities in reports for the MIMIC training set.
  • ...and 8 more figures