Table of Contents
Fetching ...

Vision-Language Models Do Not Understand Negation

Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip Torr, Yoon Kim, Marzyeh Ghassemi

TL;DR

A data-centric approach wherein CLIP models are finetune on large-scale synthetic datasets containing millions of negated captions is explored, which can result in a 10% increase in recall on negated queries and a 28% boost in accuracy on multiple-choice questions with negated captions.

Abstract

Many practical vision-language applications require models that understand negation, e.g., when using natural language to retrieve images which contain certain objects but not others. Despite advancements in vision-language models (VLMs) through large-scale training, their ability to comprehend negation remains underexplored. This study addresses the question: how well do current VLMs understand negation? We introduce NegBench, a new benchmark designed to evaluate negation understanding across 18 task variations and $79$k examples spanning image, video, and medical datasets. The benchmark consists of two core tasks designed to evaluate negation understanding in diverse multimodal settings: Retrieval with Negation and Multiple Choice Questions with Negated Captions. Our evaluation reveals that modern VLMs struggle significantly with negation, often performing at chance level. To address these shortcomings, we explore a data-centric approach wherein we finetune CLIP models on large-scale synthetic datasets containing millions of negated captions. We show that this approach can result in a 10% increase in recall on negated queries and a 28% boost in accuracy on multiple-choice questions with negated captions.

Vision-Language Models Do Not Understand Negation

TL;DR

A data-centric approach wherein CLIP models are finetune on large-scale synthetic datasets containing millions of negated captions is explored, which can result in a 10% increase in recall on negated queries and a 28% boost in accuracy on multiple-choice questions with negated captions.

Abstract

Many practical vision-language applications require models that understand negation, e.g., when using natural language to retrieve images which contain certain objects but not others. Despite advancements in vision-language models (VLMs) through large-scale training, their ability to comprehend negation remains underexplored. This study addresses the question: how well do current VLMs understand negation? We introduce NegBench, a new benchmark designed to evaluate negation understanding across 18 task variations and k examples spanning image, video, and medical datasets. The benchmark consists of two core tasks designed to evaluate negation understanding in diverse multimodal settings: Retrieval with Negation and Multiple Choice Questions with Negated Captions. Our evaluation reveals that modern VLMs struggle significantly with negation, often performing at chance level. To address these shortcomings, we explore a data-centric approach wherein we finetune CLIP models on large-scale synthetic datasets containing millions of negated captions. We show that this approach can result in a 10% increase in recall on negated queries and a 28% boost in accuracy on multiple-choice questions with negated captions.
Paper Structure (24 sections, 2 equations, 14 figures, 3 tables)

This paper contains 24 sections, 2 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: We present NegBench with image retrieval and multiple-choice tasks to evaluate negation understanding. CLIP-based models frequently misinterpret negation in both tasks, but we show how a synthetic data approach can improve performance.
  • Figure 2: General Pipeline for Constructing NegBench. We start by extracting positive concepts from vision datasets. An LLM proposes negative concepts, which are verified with an object detector for datasets without explicit object annotations. We use templates to generate captions with negation, then paraphrase them by an LLM to ensure linguistic variety and robust evaluation of negation understanding.
  • Figure 3: Performance drop in recall@5 on (a) COCO and (b) HardNeg-Syn text-to-image retrieval with negated captions (green stars) compared to original captions (orange circles). All models show substantial drops in performance, with NegCLIP experiencing the largest drop of 23.0% on HardNeg-Syn, which features hard negatives requiring stronger negation reasoning.
  • Figure 4: MCQ-Neg performance across model families.(a) CLIP-based models perform near random guessing (shown as a red dashed line), revealing their poor ability to handle negation. (b) Increasing model size (ViT-B$\rightarrow$L$\rightarrow$H) and using more advanced joint-embedding models (SigLIP, AIMV2) does not lead to better negation understanding, despite strong performance on other VLM tasks. (c) Medical VLMs experience large performance drops on negation MCQs, highlighting the risks of affirmation bias in high-stakes applications.
  • Figure 5: Performance by MCQ type: Affirmation, Negation, and Hybrid. CLIP-like models exhibit strong affirmation bias—they perform well on Affirmation MCQs (left panel), but fail on Negation MCQs (middle panel), often performing much below random chance.
  • ...and 9 more figures