Table of Contents
Fetching ...

Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning

Zesheng Yang, Xi Jiang, Bingzhang Hu, Weili Guan, Runmin Cong, Guo-Jun Qi, Feng Zheng

Abstract

Current vision-language detection and grounding models predominantly focus on prompts with positive semantics and often struggle to accurately interpret and ground complex expressions containing negative semantics. A key reason for this limitation is the lack of high-quality training data that explicitly captures discriminative negative samples and negation-aware language descriptions. To address this challenge, we introduce D-Negation, a new dataset that provides objects annotated with both positive and negative semantic descriptions. Building upon the observation that negation reasoning frequently appears in natural language, we further propose a grouped opposition-based learning framework that learns negation-aware representations from limited samples. Specifically, our method organizes opposing semantic descriptions from D-Negation into structured groups and formulates two complementary loss functions that encourage the model to reason about negation and semantic qualifiers. We integrate the proposed dataset and learning strategy into a state-of-the-art language-based grounding model. By fine-tuning fewer than 10 percent of the model parameters, our approach achieves improvements of up to 4.4 mAP and 5.7 mAP on positive and negative semantic evaluations, respectively. These results demonstrate that explicitly modeling negation semantics can substantially enhance the robustness and localization accuracy of vision-language grounding models.

Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning

Abstract

Current vision-language detection and grounding models predominantly focus on prompts with positive semantics and often struggle to accurately interpret and ground complex expressions containing negative semantics. A key reason for this limitation is the lack of high-quality training data that explicitly captures discriminative negative samples and negation-aware language descriptions. To address this challenge, we introduce D-Negation, a new dataset that provides objects annotated with both positive and negative semantic descriptions. Building upon the observation that negation reasoning frequently appears in natural language, we further propose a grouped opposition-based learning framework that learns negation-aware representations from limited samples. Specifically, our method organizes opposing semantic descriptions from D-Negation into structured groups and formulates two complementary loss functions that encourage the model to reason about negation and semantic qualifiers. We integrate the proposed dataset and learning strategy into a state-of-the-art language-based grounding model. By fine-tuning fewer than 10 percent of the model parameters, our approach achieves improvements of up to 4.4 mAP and 5.7 mAP on positive and negative semantic evaluations, respectively. These results demonstrate that explicitly modeling negation semantics can substantially enhance the robustness and localization accuracy of vision-language grounding models.
Paper Structure (37 sections, 9 equations, 6 figures, 13 tables, 1 algorithm)

This paper contains 37 sections, 9 equations, 6 figures, 13 tables, 1 algorithm.

Figures (6)

  • Figure 1: Intuition of positive and negative semantic referring. Besides object names and positive logic descriptions, humans can refer to objects with negative-semantic. For instance, when unsure how to describe the color of the calico cat, one might say, "the cat not in black." Existing grounding models struggle to understand such referring and may even provide opposite localization results. By further fine-tuning, our method can efficiently enhance the grounding ability to complex referring.
  • Figure 2: (a) We use the advanced MLLM, GPT-4V, to generate 6 positive semantic and six negative semantic referrings for every selected image. (b) The proposed D-Negation dataset contains a total of 13,893 images across 80 categories. (c) The proposed D-Negation dataset contains 139,980 text annotations. What sets it apart from existing datasets is the significantly higher frequency of negation words, as well as a greater tendency to use modifiers to describe objects.
  • Figure 3: Overview of the proposed fine-tuning framework with explicit semantic-opposition constraints. The framework is designed for the D-Negation dataset and is compatible with visual grounding models that incorporate a vision--language fusion module. Given an image and semantically opposed textual descriptions (e.g., positive vs. negative attributes), text embeddings interact with visual features in the fusion module and are decoded into grounding predictions. Beyond standard grounding losses, we introduce two complementary objectives. The TSO loss imposes a distance-based exclusion constraint in the text embedding space, explicitly pushing semantically opposed descriptions (e.g., "red" vs. "not red") apart. The PNC loss enforces a semantic exclusion constraint, ensuring that a visual region cannot be simultaneously aligned with both polarities of the same attribute. Together, these constraints address negation-induced grounding ambiguity by enforcing semantic opposition at both the linguistic and cross-modal fusion levels.
  • Figure 4: Qualitative Comparison of Model Outputs. Visualization of the APE model's outputs on representative examples after fine-tuning with our proposed method. For clarity, the prompt tokens within bounding boxes are displayed beneath each image.
  • Figure 5: Performance gain of single and combined attributes on the $D^3$ dataset. Each attribute enhances the model's performance, and combining multiple attributes leads to even greater improvements. As more attributes are added, the performance gain gradually decreases and eventually stabilizes.
  • ...and 1 more figures