Table of Contents
Fetching ...

CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally

Darina Koishigarina, Arnas Uselis, Seong Joon Oh

TL;DR

This work investigates why CLIP exhibits cross-modal bag-of-words behavior and whether attribute-object binding exists within CLIP's unimodal embeddings. It shows that attribute-object binding information is present in both image and text modalities, identifying cross-modal alignment via cosine similarity as the bottleneck. The authors propose LABCLIP, a simple linear transformation $\mathbf{A} \in \mathbb{R}^{D \times D}$ applied to text embeddings and trained with synthetic negatives, to significantly improve cross-modal binding without altering CLIP encoders. Across synthetic datasets (CLEVR, PUG:SPAR, PUG:SPARE) and real-world benchmarks (COCO, ARO, SugarCrepe), LABCLIP reduces the modality gap and enhances compositional reasoning, underscoring that targeted alignment can substantially boost CLIP-like models' understanding of structure.

Abstract

CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular, CLIP struggles to correctly bind attributes to their corresponding objects when multiple objects are present in an image or text. In this work, we investigate why CLIP exhibits this BoW-like behavior. We find that the correct attribute-object binding information is already present in individual text and image modalities. Instead, the issue lies in the cross-modal alignment, which relies on cosine similarity. To address this, we propose Linear Attribute Binding CLIP or LABCLIP. It applies a linear transformation to text embeddings before computing cosine similarity. This approach significantly improves CLIP's ability to bind attributes to correct objects, thereby enhancing its compositional understanding. The code is available at https://github.com/kdariina/CLIP-not-BoW-unimodally.

CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally

TL;DR

This work investigates why CLIP exhibits cross-modal bag-of-words behavior and whether attribute-object binding exists within CLIP's unimodal embeddings. It shows that attribute-object binding information is present in both image and text modalities, identifying cross-modal alignment via cosine similarity as the bottleneck. The authors propose LABCLIP, a simple linear transformation applied to text embeddings and trained with synthetic negatives, to significantly improve cross-modal binding without altering CLIP encoders. Across synthetic datasets (CLEVR, PUG:SPAR, PUG:SPARE) and real-world benchmarks (COCO, ARO, SugarCrepe), LABCLIP reduces the modality gap and enhances compositional reasoning, underscoring that targeted alignment can substantially boost CLIP-like models' understanding of structure.

Abstract

CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular, CLIP struggles to correctly bind attributes to their corresponding objects when multiple objects are present in an image or text. In this work, we investigate why CLIP exhibits this BoW-like behavior. We find that the correct attribute-object binding information is already present in individual text and image modalities. Instead, the issue lies in the cross-modal alignment, which relies on cosine similarity. To address this, we propose Linear Attribute Binding CLIP or LABCLIP. It applies a linear transformation to text embeddings before computing cosine similarity. This approach significantly improves CLIP's ability to bind attributes to correct objects, thereby enhancing its compositional understanding. The code is available at https://github.com/kdariina/CLIP-not-BoW-unimodally.

Paper Structure

This paper contains 24 sections, 4 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: LABCLIP mitigates the BoW behavior of CLIP. (1) It has been reported that CLIP behaves like a BoW model with weak attribute-object binding. (2) We discover that embeddings of individual image and text modalities already contain the attribute-object binding information; this suggests that the cross-modal BoWness stems from the lack of alignment across the modalities. (3) A simple linear transformation of the text modality effectively mitigates the BoWness of CLIP.
  • Figure 2: Comparison of examples from PUG:SPAR Bordes2024 and PUG:SPARE. In PUG:SPAR, attributes correlated with object positions: objects on the left are linked to "blue" or "grass" and objects on the right are "red" or "stone". Our dataset PUG:SPARE de-correlates the potential shortcut.
  • Figure 3: Uni-modal attribute-object binding. (a) we train a linear probe per object to distinguish its color within image and text modality separately. (b) the linear probe establishes decision boundaries in CLIP’s representation space that differentiate between various attribute-object associations.
  • Figure 4: Image and text embeddings effectively encode multiple objects. We show the average linear probing accuracy on CLEVR as the number of objects increases. While performance slightly decreases, it remains relatively robust.
  • Figure 5: Alignment reduces the similarity between permuted text pairs. We show the distributions of cosine similarity between original and permuted text, before and after alignment.
  • ...and 8 more figures