Table of Contents
Fetching ...

Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

Junhao Xiao, Zhiyu Wu, Hao Lin, Yi Chen, Yahui Liu, Xiaoran Zhao, Zixu Wang, Zejiang He

TL;DR

Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization, indicating stronger robustness across domains.

Abstract

Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.

Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

TL;DR

Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization, indicating stronger robustness across domains.

Abstract

Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.
Paper Structure (14 sections, 20 equations, 4 figures, 3 tables)

This paper contains 14 sections, 20 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: CLIPGlasses enhances CLIP’s capacity for negation understanding by introducing a dynamic repulsion mechanism that suppresses image-text similarity for negated concepts, thus enabling inverse matching while preserving alignment with affirmed content.
  • Figure 2: t-SNE tsne visualization of CLIP text features for multiple positive-negative sentence pairs (e.g., "there is a woman" vs. "there is not a woman"). Circles and squares denote positive and negative forms, colors distinguish different pairs. Feature clusters across pairs are well-separated, showing CLIP’s strong instance-level discrimination. However, positive and negative features within individual pair remain closely positioned, indicating that while CLIP has limited negation modeling capabilities, there exists clear potential for semantic disentanglement.
  • Figure 3: CLIPGlasses enhances CLIP's capability to model negative semantics by introducing two modules: Lens and Frame. Lens disentangles negated concepts (e.g., "dog" in "no dog") from the text embedding $T_{\text{clip}}$. Frame dynamically predicts a repulsion strength $\lambda$ based on cross-modal context. The final similarity score is computed as $S = S_{\text{I2T}} - \lambda \cdot S_{\text{I2T}}^{\text{neg}}$, aligning images with affirmed content while repelling from negated concepts when negation is present in the text.
  • Figure 4: Distribution of predicted repulsion weight $\lambda$ under varying negation strengths. Stronger negations (e.g., "no") yield higher $\lambda$, confirming the model's ability to adaptively modulate semantic repulsion.