Table of Contents
Fetching ...

TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP

Yuliang Cai, Jesse Thomason, Mohammad Rostami

TL;DR

This work tackles the limited negation understanding of CLIP by introducing TNG-CLIP, a training-time negation data generation framework that creates diverse negation captions on-the-fly and adds only a modest 2.5% training-time overhead. It further offers Neg-TtoI, the first benchmark for negation-aware text-to-image generation prompts, enabling systematic evaluation of generation with negation semantics. Across image-to-text matching, text-to-image retrieval, and image generation tasks, TNG-CLIP achieves state-of-the-art results and demonstrates strong generalization, outperforming fixed-data and LLM-based approaches. By making negation data generation dynamic and task-diverse, this work provides a practical pathway to more robust, negation-aware vision-language models and introduces a valuable benchmark for future research in negation-aware generation.

Abstract

Vision-language models (VLMs), such as CLIP, have demonstrated strong performance across a range of downstream tasks. However, CLIP is still limited in negation understanding: the ability to recognize the absence or exclusion of a concept. Existing methods address the problem by using a large language model (LLM) to generate large-scale data of image captions containing negation for further fine-tuning CLIP. However, these methods are both time- and compute-intensive, and their evaluations are typically restricted to image-text matching tasks. To expand the horizon, we (1) introduce a training-time negation data generation pipeline such that negation captions are generated during the training stage, which only increases 2.5% extra training time, and (2) we propose the first benchmark, Neg-TtoI, for evaluating text-to-image generation models on prompts containing negation, assessing model's ability to produce semantically accurate images. We show that our proposed method, TNG-CLIP, achieves SOTA performance on diverse negation benchmarks of image-to-text matching, text-to-image retrieval, and image generation.

TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP

TL;DR

This work tackles the limited negation understanding of CLIP by introducing TNG-CLIP, a training-time negation data generation framework that creates diverse negation captions on-the-fly and adds only a modest 2.5% training-time overhead. It further offers Neg-TtoI, the first benchmark for negation-aware text-to-image generation prompts, enabling systematic evaluation of generation with negation semantics. Across image-to-text matching, text-to-image retrieval, and image generation tasks, TNG-CLIP achieves state-of-the-art results and demonstrates strong generalization, outperforming fixed-data and LLM-based approaches. By making negation data generation dynamic and task-diverse, this work provides a practical pathway to more robust, negation-aware vision-language models and introduces a valuable benchmark for future research in negation-aware generation.

Abstract

Vision-language models (VLMs), such as CLIP, have demonstrated strong performance across a range of downstream tasks. However, CLIP is still limited in negation understanding: the ability to recognize the absence or exclusion of a concept. Existing methods address the problem by using a large language model (LLM) to generate large-scale data of image captions containing negation for further fine-tuning CLIP. However, these methods are both time- and compute-intensive, and their evaluations are typically restricted to image-text matching tasks. To expand the horizon, we (1) introduce a training-time negation data generation pipeline such that negation captions are generated during the training stage, which only increases 2.5% extra training time, and (2) we propose the first benchmark, Neg-TtoI, for evaluating text-to-image generation models on prompts containing negation, assessing model's ability to produce semantically accurate images. We show that our proposed method, TNG-CLIP, achieves SOTA performance on diverse negation benchmarks of image-to-text matching, text-to-image retrieval, and image generation.

Paper Structure

This paper contains 33 sections, 6 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: We present TNG-CLIP, a negation-aware CLIP that achieves outstanding negation understanding in image-to-text matching, text-to-image retrieval and proposed image generation Neg-TtoI benchmarks.
  • Figure 2: Training Procedure of TNG-CLIP. The diagram shows the data generation pipeline during the training for one sample in the batch. For an image-text pair, $P_o$, the most similar image pair, $P_s$ is selected by the cosine similarity of their embedded image features. The captions from $P_o$ and $P_s$ are used to find the negation object and generate two types of negation captions. The final image-text set, $S_i$, for $i^{th}$ image-text pair will be composed of one image, $I_i$, one original caption, $T_{o_i}$, one compositional negation caption, $T_{nc_i}$, and one full negation caption, $T_{nf_j}$ from another random sample.
  • Figure 3: The zero shot image classification accuracy of pre-trained CLIP and TNG-CLIP on eight image classification benchmarks.