Table of Contents
Fetching ...

SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models

Sepehr Kazemi Ranjbar, Kumail Alhamoud, Marzyeh Ghassemi

TL;DR

SpaceVLM tackles the challenge of negation in vision–language models by modeling negation as a subspace in the joint embedding space rather than a single embedding. Grounded in the empirical divisibility of CLIP-like embeddings into semantically coherent regions, it computes a center direction from affirmative and negated embeddings using spherical caps and a cosine-threshold, enabling a training-free, model-agnostic negation score. The method yields around a 30% average improvement on negation tasks across retrieval, MCQ, and text-to-image generation, while preserving zero-shot performance on affirmative prompts. Evaluations across 40+ settings and multiple backbones demonstrate robustness to threshold choices and LLM pre-processors, with practical gains in negation-aware generation, suggesting a promising geometric perspective for broader logical reasoning in VLMs.

Abstract

Vision-Language Models (VLMs) struggle with negation. Given a prompt like "retrieve (or generate) a street scene without pedestrians," they often fail to respect the "not." Existing methods address this limitation by fine-tuning on large negation datasets, but such retraining often compromises the model's zero-shot performance on affirmative prompts. We show that the embedding space of VLMs, such as CLIP, can be divided into semantically consistent subspaces. Based on this property, we propose a training-free framework that models negation as a subspace in the joint embedding space rather than a single point (Figure 1). To find the matching image for a caption such as "A but not N," we construct two spherical caps around the embeddings of A and N, and we score images by the central direction of the region that is close to A and far from N. Across retrieval, MCQ, and text-to-image tasks, our method improves negation understanding by about 30% on average over prior methods. It closes the gap between affirmative and negated prompts while preserving the zero-shot performance that fine-tuned models fail to maintain. Code will be released upon publication.

SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models

TL;DR

SpaceVLM tackles the challenge of negation in vision–language models by modeling negation as a subspace in the joint embedding space rather than a single embedding. Grounded in the empirical divisibility of CLIP-like embeddings into semantically coherent regions, it computes a center direction from affirmative and negated embeddings using spherical caps and a cosine-threshold, enabling a training-free, model-agnostic negation score. The method yields around a 30% average improvement on negation tasks across retrieval, MCQ, and text-to-image generation, while preserving zero-shot performance on affirmative prompts. Evaluations across 40+ settings and multiple backbones demonstrate robustness to threshold choices and LLM pre-processors, with practical gains in negation-aware generation, suggesting a promising geometric perspective for broader logical reasoning in VLMs.

Abstract

Vision-Language Models (VLMs) struggle with negation. Given a prompt like "retrieve (or generate) a street scene without pedestrians," they often fail to respect the "not." Existing methods address this limitation by fine-tuning on large negation datasets, but such retraining often compromises the model's zero-shot performance on affirmative prompts. We show that the embedding space of VLMs, such as CLIP, can be divided into semantically consistent subspaces. Based on this property, we propose a training-free framework that models negation as a subspace in the joint embedding space rather than a single point (Figure 1). To find the matching image for a caption such as "A but not N," we construct two spherical caps around the embeddings of A and N, and we score images by the central direction of the region that is close to A and far from N. Across retrieval, MCQ, and text-to-image tasks, our method improves negation understanding by about 30% on average over prior methods. It closes the gap between affirmative and negated prompts while preserving the zero-shot performance that fine-tuned models fail to maintain. Code will be released upon publication.

Paper Structure

This paper contains 13 sections, 13 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Given a caption such as "Not a photo of a cat", standard VLM approaches attempt to map this negative caption to a single point in the embedding space, which makes it ambiguous where the correct destination should be. In contrast, our approach maps the negative caption to a subspace rather than a point, enhancing the model’s ability to handle negation effectively.
  • Figure 2: In both (a) Image Retrieval and (b) Text-to-Image (T2I) Generation, CLIP embeds the input prompt “a picture of a dog but not on grass” near images that include both dog and grass, leading to incorrect retrievals or generations. In (c) MCQ, CLIP assigns similar similarity scores to all captions mentioning “fish” and “coral,” regardless of whether they include or exclude a concept, leading to incorrect image-text matching. By modeling negation as a subspace, SpaceVLM fixes all these issues. As summarized in (d), this geometric modeling of SpaceVLM empirically improves negation understanding across these tasks, while preserving performance on affirmative prompts.
  • Figure 3: (a) Cosine similarity score distribution of images within the same category. (b) Cosine similarity score distribution between the textual prompt "A photo of a $<$category$>$" and images belonging to that category.
  • Figure 4: A simple 2D illustration of our approach. Each vector represents the center of its corresponding arc. Given a caption such as "A photo of <a> but not <n>", $e_a$ denotes the embedding of "A photo of <a>" and $e_n$ denotes the embedding of "A photo of <n>". We then identify a region that lies close to $e_a$ but distant from $e_n$. The resulting vector $\hat{d}$ serves as the final text embedding, effectively encoding both the affirmative and negated components of the original caption.
  • Figure 5: Algorithm 1: PyTorch-style pseudocode for SpaceVLM, which computes negation-aware text embeddings for a generic VLM.
  • ...and 3 more figures

Theorems & Definitions (1)

  • proof : Proof sketch