Table of Contents
Fetching ...

The Effect of Negation on CLIP in Medical Imaging: Limitations of Contrastive Language-Image Pretraining

Jasmine Vu, Shivanand Sheshappanavar

TL;DR

The paper tackles the critical issue that CLIP-based models struggle with negation in medical prompts, which can lead to erroneous image retrieval in chest X-ray tasks. It proposes two fine-tuning strategies: CON1 CLIP using standard in-batch contrastive learning and CON2 CLIP employing a CoN-CLIP-inspired objective with semantic oppositions and distractors. Results show CON1 offers modest gains in negation sensitivity and a slight drop in positive-prompt accuracy, while CON2 delivers larger improvements (≈15% in negation-related retrieval) with more robust internal representations, evidenced by token attribution and embedding analyses. These gains suggest that negation-aware fine-tuning can enhance the reliability of vision-language systems in clinical workflows, though generalization limits and safety implications require further study.

Abstract

Large vision-language models like CLIP are increasingly used in medical imaging tasks due to their ability to align images and text without the need for extensive labeled data. This makes them particularly useful for applications like image retrieval, report generation, and classification in clinical settings. A potential issue to this approach is that CLIP-based models often under perform when interpreting negated phrases, which is especially problematic in the context of medical diagnosing. In this study, we evaluate the Stanford AIMI CheXagent model on its ability to correctly retrieve chest X-ray images using prompts with and without negation. The goal of this project is to understand where this model fails and then use it as a base model to improve its retrieval accuracy by fine tuning methods outlined in previous work. Results from this study show improvement in handling of negation in the CLIP model with a slight decrease in accuracy of positive prompt evaluation. Alongside retrieval accuracy, we examined internal model behavior through token attribution, t-SNE projection, and attention-head ablation to better characterize how each fine tuning approach reshaped the text encoders representation of negated clinical language. Through this work, we hope to better understand the internal behavior of CLIP and improve its handling of negation using clinically relevant language for improving its reliability in medical AI devices.

The Effect of Negation on CLIP in Medical Imaging: Limitations of Contrastive Language-Image Pretraining

TL;DR

The paper tackles the critical issue that CLIP-based models struggle with negation in medical prompts, which can lead to erroneous image retrieval in chest X-ray tasks. It proposes two fine-tuning strategies: CON1 CLIP using standard in-batch contrastive learning and CON2 CLIP employing a CoN-CLIP-inspired objective with semantic oppositions and distractors. Results show CON1 offers modest gains in negation sensitivity and a slight drop in positive-prompt accuracy, while CON2 delivers larger improvements (≈15% in negation-related retrieval) with more robust internal representations, evidenced by token attribution and embedding analyses. These gains suggest that negation-aware fine-tuning can enhance the reliability of vision-language systems in clinical workflows, though generalization limits and safety implications require further study.

Abstract

Large vision-language models like CLIP are increasingly used in medical imaging tasks due to their ability to align images and text without the need for extensive labeled data. This makes them particularly useful for applications like image retrieval, report generation, and classification in clinical settings. A potential issue to this approach is that CLIP-based models often under perform when interpreting negated phrases, which is especially problematic in the context of medical diagnosing. In this study, we evaluate the Stanford AIMI CheXagent model on its ability to correctly retrieve chest X-ray images using prompts with and without negation. The goal of this project is to understand where this model fails and then use it as a base model to improve its retrieval accuracy by fine tuning methods outlined in previous work. Results from this study show improvement in handling of negation in the CLIP model with a slight decrease in accuracy of positive prompt evaluation. Alongside retrieval accuracy, we examined internal model behavior through token attribution, t-SNE projection, and attention-head ablation to better characterize how each fine tuning approach reshaped the text encoders representation of negated clinical language. Through this work, we hope to better understand the internal behavior of CLIP and improve its handling of negation using clinically relevant language for improving its reliability in medical AI devices.

Paper Structure

This paper contains 20 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: Flowchart of Stanford AIMI CheXagent chest X-ray image retrieval given a prompt containing affirmation for pleural effusion and a prompt containing negation for plueral effusion. All chest X-ray images shown in figures were obtained from the Open-i repository and originate from the PubMed Central Open Access subset or the Indiana University Chest X-ray dataset. All images are publicly available, fully de-identified, and used in accordance with their respective Creative Commons or open-access licensesAuthors14e.
  • Figure 2: Overview of our contrastive setup, adapted from a diagram in a Medium article on InfoNCE lossmedium_infonce. All chest X-ray images shown in figures were obtained from the Open-i repository and originate from the PubMed Central Open Access subset or the Indiana University Chest X-ray datasetAuthors14e.
  • Figure 3: Use of negations and distractor images in a contrastive objective for finetuning CON2 CLIP text encoder for negation understanding adapted from Singh et al. Authors14d All chest X-ray images shown in figures were obtained from the Open-i repository and originate from the PubMed Central Open Access subset or the Indiana University Chest X-ray datasetAuthors14e.
  • Figure 4: Token attribution of CheXagent, CON1 CLIP, and CON2 CLIP given a prompt containing negation for pleural effusion
  • Figure 5: t-SNE graph of CheXagent (top), CON1 CLIP (middle), and CON2 CLIP (bottom) depicting Structured Positive, Structured Negative, Natural Positive, and Natural Negative prompt clusters.
  • ...and 2 more figures