Table of Contents
Fetching ...

Embedding Shift Dissection on CLIP: Effects of Augmentations on VLM's Representation Learning

Ashim Dahal, Saydul Akbar Murad, Nick Rahimi

TL;DR

This paper investigates how CLIP embeddings shift under nine image augmentations to illuminate mechanistic interpretability in vision-language models. The authors implement a systematic framework using 13,312 Conceptual Captions images (with 2,000 for extra metrics) and metrics such as $L_2$ distance, cosine similarity, attention-shift, patch-similarity, edge- and detail-preservation, plus dendrogram clustering and KDE analyses. They find that noise, perspective transforms, and shift-scale rotations induce the largest embedding shifts, while brightness/contrast, horizontal flips, and some elastic/perspective augmentations show relative invariance; a strong correlation emerges between edge/detail preservation and embedding similarity. These results provide a quantitative foundation for robustness and mechanistic interpretability in VLMs, and point to directions for cross-model analyses, layer-wise studies, and adversarial-data defenses. The work advances understanding of how augmentations reshape CLIP's internal representations and offers a concrete methodology for evaluating robustness in multimodal systems, with code available at the provided repository.

Abstract

Understanding the representation shift on Vision Language Models like CLIP under different augmentations provides valuable insights on Mechanistic Interpretability. In this study, we show the shift on CLIP's embeddings on 9 common augmentation techniques: noise, blur, color jitter, scale and rotate, flip, elastic and perspective transforms, random brightness and contrast, and coarse dropout of pixel blocks. We scrutinize the embedding shifts under similarity on attention map, patch, edge, detail preservation, cosine similarity, L2 distance, pairwise distance and dendrogram clusters and provide qualitative analysis on sample images. Our findings suggest certain augmentations like noise, perspective transform and shift scaling have higher degree of drastic impact on embedding shift. This study provides a concrete foundation for future work on VLM's robustness for mechanical interpretation and adversarial data defense. The code implementation for this study can be found on \href{https://github.com/ashimdahal/clip-shift-analysis}{https://github.com/ashimdahal/clip-shift-analysis}.

Embedding Shift Dissection on CLIP: Effects of Augmentations on VLM's Representation Learning

TL;DR

This paper investigates how CLIP embeddings shift under nine image augmentations to illuminate mechanistic interpretability in vision-language models. The authors implement a systematic framework using 13,312 Conceptual Captions images (with 2,000 for extra metrics) and metrics such as distance, cosine similarity, attention-shift, patch-similarity, edge- and detail-preservation, plus dendrogram clustering and KDE analyses. They find that noise, perspective transforms, and shift-scale rotations induce the largest embedding shifts, while brightness/contrast, horizontal flips, and some elastic/perspective augmentations show relative invariance; a strong correlation emerges between edge/detail preservation and embedding similarity. These results provide a quantitative foundation for robustness and mechanistic interpretability in VLMs, and point to directions for cross-model analyses, layer-wise studies, and adversarial-data defenses. The work advances understanding of how augmentations reshape CLIP's internal representations and offers a concrete methodology for evaluating robustness in multimodal systems, with code available at the provided repository.

Abstract

Understanding the representation shift on Vision Language Models like CLIP under different augmentations provides valuable insights on Mechanistic Interpretability. In this study, we show the shift on CLIP's embeddings on 9 common augmentation techniques: noise, blur, color jitter, scale and rotate, flip, elastic and perspective transforms, random brightness and contrast, and coarse dropout of pixel blocks. We scrutinize the embedding shifts under similarity on attention map, patch, edge, detail preservation, cosine similarity, L2 distance, pairwise distance and dendrogram clusters and provide qualitative analysis on sample images. Our findings suggest certain augmentations like noise, perspective transform and shift scaling have higher degree of drastic impact on embedding shift. This study provides a concrete foundation for future work on VLM's robustness for mechanical interpretation and adversarial data defense. The code implementation for this study can be found on \href{https://github.com/ashimdahal/clip-shift-analysis}{https://github.com/ashimdahal/clip-shift-analysis}.

Paper Structure

This paper contains 15 sections, 11 equations, 11 figures.

Figures (11)

  • Figure 1: Qualitative analysys of final layer of attention map of CLIP for vision augmentation techniques
  • Figure 2: Research methodology and list of qualitative analysis performed
  • Figure 3: Sorted heatmap of augmentation performance
  • Figure 4: Overall Analysis of Proposed Methodology.
  • Figure 5: Contextualized radar plot of additional metrics
  • ...and 6 more figures