Table of Contents
Fetching ...

CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation

Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Banayeeanzade, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

TL;DR

This work reveals fine-grained biases in CLIP's encoders when handling multi-object scenes: the text encoder tends to overemphasize the first-mentioned object while the image encoder favors larger objects. By introducing the ComCO (and SIMCO) datasets, the authors provide controlled multi-object benchmarks and show that these biases degrade image-text matching in real-world datasets like COCO and even propagate to text-to-image generation via prompt order in Stable Diffusion. They dissect potential origins in the ViT-based image encoder and the cross-modal contrastive training that aligns image-text representations, linking observed biases to training data characteristics in LAION and to the progression of training. The paper also explores preliminary mitigation through per-object caption splitting and embedding aggregation, demonstrating improvements in matching robustness, while acknowledging limitations and outlining directions for bias-mitigation research in vision-language systems. Overall, the study emphasizes the need to address compositional biases to enhance robustness of vision-language models in complex, real-world scenarios and informs future work on training data curation and model architecture adjustments.

Abstract

Contrastive Language-Image Pre-training (CLIP) models excel in zero-shot classification, yet face challenges in complex multi-object scenarios. This study offers a comprehensive analysis of CLIP's limitations in these contexts using a specialized dataset, ComCO, designed to evaluate CLIP's encoders in diverse multi-object scenarios. Our findings reveal significant biases: the text encoder prioritizes first-mentioned objects, and the image encoder favors larger objects. Through retrieval and classification tasks, we quantify these biases across multiple CLIP variants and trace their origins to CLIP's training process, supported by analyses of the LAION dataset and training progression. Our image-text matching experiments show substantial performance drops when object size or token order changes, underscoring CLIP's instability with rephrased but semantically similar captions. Extending this to longer captions and text-to-image models like Stable Diffusion, we demonstrate how prompt order influences object prominence in generated images. For more details and access to our dataset and analysis code, visit our project repository: https://clip-oscope.github.io.

CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation

TL;DR

This work reveals fine-grained biases in CLIP's encoders when handling multi-object scenes: the text encoder tends to overemphasize the first-mentioned object while the image encoder favors larger objects. By introducing the ComCO (and SIMCO) datasets, the authors provide controlled multi-object benchmarks and show that these biases degrade image-text matching in real-world datasets like COCO and even propagate to text-to-image generation via prompt order in Stable Diffusion. They dissect potential origins in the ViT-based image encoder and the cross-modal contrastive training that aligns image-text representations, linking observed biases to training data characteristics in LAION and to the progression of training. The paper also explores preliminary mitigation through per-object caption splitting and embedding aggregation, demonstrating improvements in matching robustness, while acknowledging limitations and outlining directions for bias-mitigation research in vision-language systems. Overall, the study emphasizes the need to address compositional biases to enhance robustness of vision-language models in complex, real-world scenarios and informs future work on training data curation and model architecture adjustments.

Abstract

Contrastive Language-Image Pre-training (CLIP) models excel in zero-shot classification, yet face challenges in complex multi-object scenarios. This study offers a comprehensive analysis of CLIP's limitations in these contexts using a specialized dataset, ComCO, designed to evaluate CLIP's encoders in diverse multi-object scenarios. Our findings reveal significant biases: the text encoder prioritizes first-mentioned objects, and the image encoder favors larger objects. Through retrieval and classification tasks, we quantify these biases across multiple CLIP variants and trace their origins to CLIP's training process, supported by analyses of the LAION dataset and training progression. Our image-text matching experiments show substantial performance drops when object size or token order changes, underscoring CLIP's instability with rephrased but semantically similar captions. Extending this to longer captions and text-to-image models like Stable Diffusion, we demonstrate how prompt order influences object prominence in generated images. For more details and access to our dataset and analysis code, visit our project repository: https://clip-oscope.github.io.

Paper Structure

This paper contains 57 sections, 1 theorem, 3 equations, 14 figures, 12 tables.

Key Result

Theorem 1

Let elements of ${\mathbf z}$ be independent, zero-mean, and unit-variance. The contrastive loss for the ideal text encoder, $i_\omega(T) = {\mathbf z}$ converges to that of a non-ideal incomplete one, i.e. $i_{\omega^\prime}(T) = {\mathbf z}_s$, where ${\mathbf z}_s$ is the first $d-k$ dimensions o

Figures (14)

  • Figure 1: Overview of our key contributions. Step 1: We create ComCO dataset for controlled multi-object experiments. Step 2: We identify biases in CLIP's image encoder (favoring larger objects) and text encoder (prioritizing first-mentioned objects). Step 3: We investigate the origin of these biases, finding a connection to training data characteristics. Step 4: We demonstrate the practical impacts of these biases on image-text matching task, showing how they affect model performance in multi-object scenarios.
  • Figure 2: Experimental setup for Text-based Object Retrieval (TOR) and Image-based Object Retrieval (IOR) tasks. a) TOR: The CLIP text encoder generates embeddings for multi-object and single-object texts. Cosine similarity scores are calculated between the base text embedding and single-object text embeddings to identify the most similar object. b) IOR: The CLIP image encoder generates embeddings for multi-object and single-object images. Cosine similarity scores are calculated between the base image embedding and single-object image embeddings to identify the most similar object.
  • Figure 3: Attention allocation from the CLS token to objects of different sizes in the ComCO dataset. a) Qualitative results showing the CLS token's attention to each object. b) Quantitative analysis of attention distribution across 8,000 images, with each image containing one large and two small objects. The bar chart shows the average attention allocated to the large object versus the smaller ones, demonstrating a bias towards larger objects.
  • Figure 4: a) Top-1 Object Retrieval accuracy comparison for sentences where the first object is either large or small. The higher TOR accuracy for sentences beginning with large objects supports the hypothesis that larger objects, when mentioned first, exert a stronger influence on text embeddings due to cross-modal alignment with their prominent visual representation in images. b) Distribution of the position of the largest object within image captions from the LAION datasets. The results show a consistent bias where larger objects tend to be mentioned earlier in text descriptions. c) Progression of TOR rates across different training stages, indicating that text-side bias strengthens as the model is exposed to more data, suggesting the cumulative effect of image-side bias being transferred to the text encoder through contrastive learning.
  • Figure 5: An example of the correct and incorrect caption structures in the first and second scenarios.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Theorem 1