Vision-Language Models Do Not Understand Negation

Kumail Alhamoud; Shaden Alshammari; Yonglong Tian; Guohao Li; Philip Torr; Yoon Kim; Marzyeh Ghassemi

Vision-Language Models Do Not Understand Negation

Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip Torr, Yoon Kim, Marzyeh Ghassemi

TL;DR

A data-centric approach wherein CLIP models are finetune on large-scale synthetic datasets containing millions of negated captions is explored, which can result in a 10% increase in recall on negated queries and a 28% boost in accuracy on multiple-choice questions with negated captions.

Abstract

Many practical vision-language applications require models that understand negation, e.g., when using natural language to retrieve images which contain certain objects but not others. Despite advancements in vision-language models (VLMs) through large-scale training, their ability to comprehend negation remains underexplored. This study addresses the question: how well do current VLMs understand negation? We introduce NegBench, a new benchmark designed to evaluate negation understanding across 18 task variations and $79$k examples spanning image, video, and medical datasets. The benchmark consists of two core tasks designed to evaluate negation understanding in diverse multimodal settings: Retrieval with Negation and Multiple Choice Questions with Negated Captions. Our evaluation reveals that modern VLMs struggle significantly with negation, often performing at chance level. To address these shortcomings, we explore a data-centric approach wherein we finetune CLIP models on large-scale synthetic datasets containing millions of negated captions. We show that this approach can result in a 10% increase in recall on negated queries and a 28% boost in accuracy on multiple-choice questions with negated captions.

Vision-Language Models Do Not Understand Negation

TL;DR

Abstract

k examples spanning image, video, and medical datasets. The benchmark consists of two core tasks designed to evaluate negation understanding in diverse multimodal settings: Retrieval with Negation and Multiple Choice Questions with Negated Captions. Our evaluation reveals that modern VLMs struggle significantly with negation, often performing at chance level. To address these shortcomings, we explore a data-centric approach wherein we finetune CLIP models on large-scale synthetic datasets containing millions of negated captions. We show that this approach can result in a 10% increase in recall on negated queries and a 28% boost in accuracy on multiple-choice questions with negated captions.

Paper Structure (24 sections, 2 equations, 14 figures, 3 tables)

This paper contains 24 sections, 2 equations, 14 figures, 3 tables.

Introduction
Related Work
The Negation Benchmark (NegBench)
Transforming Datasets for Negation Evaluation
General Dataset Transformation Overview.
Applicability Across Data Types and Domains
Synthetic Datasets for Controlled Evaluation
NegBench Evaluations: Results and Insights
Why Do VLMs Not Understand Negation?
A Data-Centric Approach for Improving Negation Understanding
Synthesizing a Fine-Tuning Negation Dataset
Fine-Tuning with Negation-Enriched Data
Discussion and Conclusions
Evaluating LLaVA on NegBench MCQs
LLaVA, an instruction-tuned VLM, demonstrates improvement.
...and 9 more sections

Figures (14)

Figure 1: We present NegBench with image retrieval and multiple-choice tasks to evaluate negation understanding. CLIP-based models frequently misinterpret negation in both tasks, but we show how a synthetic data approach can improve performance.
Figure 2: General Pipeline for Constructing NegBench. We start by extracting positive concepts from vision datasets. An LLM proposes negative concepts, which are verified with an object detector for datasets without explicit object annotations. We use templates to generate captions with negation, then paraphrase them by an LLM to ensure linguistic variety and robust evaluation of negation understanding.
Figure 3: Performance drop in recall@5 on (a) COCO and (b) HardNeg-Syn text-to-image retrieval with negated captions (green stars) compared to original captions (orange circles). All models show substantial drops in performance, with NegCLIP experiencing the largest drop of 23.0% on HardNeg-Syn, which features hard negatives requiring stronger negation reasoning.
Figure 4: MCQ-Neg performance across model families.(a) CLIP-based models perform near random guessing (shown as a red dashed line), revealing their poor ability to handle negation. (b) Increasing model size (ViT-B$\rightarrow$L$\rightarrow$H) and using more advanced joint-embedding models (SigLIP, AIMV2) does not lead to better negation understanding, despite strong performance on other VLM tasks. (c) Medical VLMs experience large performance drops on negation MCQs, highlighting the risks of affirmation bias in high-stakes applications.
Figure 5: Performance by MCQ type: Affirmation, Negation, and Hybrid. CLIP-like models exhibit strong affirmation bias—they perform well on Affirmation MCQs (left panel), but fail on Negation MCQs (middle panel), often performing much below random chance.
...and 9 more figures

Vision-Language Models Do Not Understand Negation

TL;DR

Abstract

Vision-Language Models Do Not Understand Negation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)