Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

Junhao Xiao; Zhiyu Wu; Hao Lin; Yi Chen; Yahui Liu; Xiaoran Zhao; Zixu Wang; Zejiang He

Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

Junhao Xiao, Zhiyu Wu, Hao Lin, Yi Chen, Yahui Liu, Xiaoran Zhao, Zixu Wang, Zejiang He

TL;DR

Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization, indicating stronger robustness across domains.

Abstract

Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.

Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

TL;DR

Abstract

Paper Structure (14 sections, 20 equations, 4 figures, 3 tables)

This paper contains 14 sections, 20 equations, 4 figures, 3 tables.

Introduction
Related Work
Preliminary: Two-Stage Modeling for Negation Understanding
Methodology
Lens: Syntax–Semantic Dual-Stream Architecture
Frame: Cross-Modal Dynamic Repulsion Weight Generator
Image-Text Matching
Training Strategy
Comparative Experiments
Comparative Analysis
Inherent Ability Retention Analysis
Ablation Analysis
Conclusion
Acknowledgements

Figures (4)

Figure 1: CLIPGlasses enhances CLIP’s capacity for negation understanding by introducing a dynamic repulsion mechanism that suppresses image-text similarity for negated concepts, thus enabling inverse matching while preserving alignment with affirmed content.
Figure 2: t-SNE tsne visualization of CLIP text features for multiple positive-negative sentence pairs (e.g., "there is a woman" vs. "there is not a woman"). Circles and squares denote positive and negative forms, colors distinguish different pairs. Feature clusters across pairs are well-separated, showing CLIP’s strong instance-level discrimination. However, positive and negative features within individual pair remain closely positioned, indicating that while CLIP has limited negation modeling capabilities, there exists clear potential for semantic disentanglement.
Figure 3: CLIPGlasses enhances CLIP's capability to model negative semantics by introducing two modules: Lens and Frame. Lens disentangles negated concepts (e.g., "dog" in "no dog") from the text embedding $T_{\text{clip}}$. Frame dynamically predicts a repulsion strength $\lambda$ based on cross-modal context. The final similarity score is computed as $S = S_{\text{I2T}} - \lambda \cdot S_{\text{I2T}}^{\text{neg}}$, aligning images with affirmed content while repelling from negated concepts when negation is present in the text.
Figure 4: Distribution of predicted repulsion weight $\lambda$ under varying negation strengths. Stronger negations (e.g., "no") yield higher $\lambda$, confirming the model's ability to adaptively modulate semantic repulsion.

Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

TL;DR

Abstract

Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)