Table of Contents
Fetching ...

NegVQA: Can Vision Language Models Understand Negation?

Yuhui Zhang, Yuchang Su, Yiming Liu, Serena Yeung-Levy

TL;DR

NegVQA addresses the challenge of negation understanding in vision-language models by introducing a large, curated VQA benchmark of 7,379 negated two-choice questions drawn from diverse domains. The dataset is generated via GPT-4o to create fluent negations of existing questions, with answer choices inverted to test true negation comprehension under zero-shot evaluation of 20 VLMs across seven families. The results reveal a pervasive struggle with negation, including a notable U-shaped scaling trend where model performance first declines with increasing size before improving, and a substantial gap relative to human performance. The work provides a critical diagnostic resource and highlights concrete directions for improving negation handling in VLMs, contributing to safer and more reliable multimodal AI systems.

Abstract

Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.

NegVQA: Can Vision Language Models Understand Negation?

TL;DR

NegVQA addresses the challenge of negation understanding in vision-language models by introducing a large, curated VQA benchmark of 7,379 negated two-choice questions drawn from diverse domains. The dataset is generated via GPT-4o to create fluent negations of existing questions, with answer choices inverted to test true negation comprehension under zero-shot evaluation of 20 VLMs across seven families. The results reveal a pervasive struggle with negation, including a notable U-shaped scaling trend where model performance first declines with increasing size before improving, and a substantial gap relative to human performance. The work provides a critical diagnostic resource and highlights concrete directions for improving negation handling in VLMs, contributing to safer and more reliable multimodal AI systems.

Abstract

Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.

Paper Structure

This paper contains 10 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: NegVQA dataset overview.(Middle)NegVQA comprises a diverse set of negated questions, totaling 7,379 instances sourced from various VQA datasets and domains (general, document/chart, reasoning, OCR). (Left/Right) Example questions from different datasets and domains, with correct answers highlighted in green.
  • Figure 2: Detailed prompts for adding the negation using GPT-4o.
  • Figure 3: Model performance and scaling analysis on NegVQA across different VLM families and task categories.(Top left) Performance on the original non-negated two-choice questions shows high accuracy and a positive scaling trend. (Top right) Performance on NegVQA (negated two-choice questions) is significantly lower, with models exhibiting a U-shaped scaling pattern—initially decreasing before improving as model size increases. (Bottom) Category-wise breakdown of NegVQA performance (reasoning, document/chart, general), where the U-shaped scaling effect is more pronounced in reasoning and document/chart categories.
  • Figure 4: Errors in negated questions generated by GPT-4o. The first question cannot be negated, while the second and third questions are negated in the condition, whereas the negation should apply to the main question.
  • Figure 5: Model performance and scaling analysis on NegVQA across different VLM families and datasets. For each of the 20 subsets in NegVQA, we present scaling curves for both the original non-negated dataset and the negated dataset from left to right, resulting in a total of 40 figures.