Table of Contents
Fetching ...

Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations

Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, Aparna Bharati

TL;DR

Negation understanding is a critical but underexplored capability in vision-language models. The authors introduce CC-Neg, a large-scale benchmark, and CoN-CLIP, a targeted fine-tuning framework that leverages negated captions and distractor images to disentangle negation semantics from visual content. Their approach yields consistent improvements in zero-shot image classification across eight datasets (average +3.85% top-1) and substantial gains on challenging compositional benchmarks (SugarCREPE +4.4%), demonstrating emergent compositional understanding. By providing a scalable, efficient data-driven approach, this work enhances semantic alignment between images and negation-aware text, with practical impact on robust multimodal reasoning.

Abstract

Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt and impairing visual semantic matching and reasoning. An important aspect of reasoning in logic and language is negations. This paper highlights the limitations of popular VLMs such as CLIP, at understanding the implications of negations, i.e., the effect of the word "not" in a given prompt. To enable evaluation of VLMs on fluent prompts with negations, we present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions. Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations. This training paradigm improves CoN-CLIP's ability to encode semantics reliably, resulting in 3.85% average gain in top-1 accuracy for zero-shot image classification across 8 datasets. Further, CoN-CLIP outperforms CLIP on challenging compositionality benchmarks such as SugarCREPE by 4.4%, showcasing emergent compositional understanding of objects, relations, and attributes in text. Overall, our work addresses a crucial limitation of VLMs by introducing a dataset and framework that strengthens semantic associations between images and text, demonstrating improved large-scale foundation models with significantly reduced computational cost, promoting efficiency and accessibility.

Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations

TL;DR

Negation understanding is a critical but underexplored capability in vision-language models. The authors introduce CC-Neg, a large-scale benchmark, and CoN-CLIP, a targeted fine-tuning framework that leverages negated captions and distractor images to disentangle negation semantics from visual content. Their approach yields consistent improvements in zero-shot image classification across eight datasets (average +3.85% top-1) and substantial gains on challenging compositional benchmarks (SugarCREPE +4.4%), demonstrating emergent compositional understanding. By providing a scalable, efficient data-driven approach, this work enhances semantic alignment between images and negation-aware text, with practical impact on robust multimodal reasoning.

Abstract

Existing vision-language models (VLMs) treat text descriptions as a unit, confusing individual concepts in a prompt and impairing visual semantic matching and reasoning. An important aspect of reasoning in logic and language is negations. This paper highlights the limitations of popular VLMs such as CLIP, at understanding the implications of negations, i.e., the effect of the word "not" in a given prompt. To enable evaluation of VLMs on fluent prompts with negations, we present CC-Neg, a dataset containing 228,246 images, true captions and their corresponding negated captions. Using CC-Neg along with modifications to the contrastive loss of CLIP, our proposed CoN-CLIP framework, has an improved understanding of negations. This training paradigm improves CoN-CLIP's ability to encode semantics reliably, resulting in 3.85% average gain in top-1 accuracy for zero-shot image classification across 8 datasets. Further, CoN-CLIP outperforms CLIP on challenging compositionality benchmarks such as SugarCREPE by 4.4%, showcasing emergent compositional understanding of objects, relations, and attributes in text. Overall, our work addresses a crucial limitation of VLMs by introducing a dataset and framework that strengthens semantic associations between images and text, demonstrating improved large-scale foundation models with significantly reduced computational cost, promoting efficiency and accessibility.
Paper Structure (34 sections, 1 equation, 8 figures, 16 tables)

This paper contains 34 sections, 1 equation, 8 figures, 16 tables.

Figures (8)

  • Figure 1: Vision-language models (VLMs) struggle to understand negations in text, observable in image-text matching and applications such as text-to-image generation dalle3midjourney.
  • Figure 2: VLMs such as CLIP often match images to negation-based distractors with higher similarities than their true captions (left). Further, CLIP accurately retrieves images of a class even when prompted with "this is not a photo of a {class}" (right).
  • Figure 3: Overview of the generation of negated captions. Given the true caption of an image, an LLM (i) decomposes it into a subject and predicate-object pairs, and then (ii) selects a random predicate-object pair to negate to finally write the negated prompt.
  • Figure 4: We report the accuracy of matching the image to its true caption for all VLMs, varying the number of predicate-objects, $\mathcal{K}$ from 1 to 5 (left). Additionally, we show the performance of all VLMs on each type of negation word used in CC-Neg (right).
  • Figure 5: We incorporate negations and distractor images in a contrastive objective for fine-tuning the CLIP text encoder towards improved negation understanding. The proposed loss functions are depicted above using the example of a training instance.
  • ...and 3 more figures