Table of Contents
Fetching ...

From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance

Maximilian Dreyer, Lorenz Hufe, Jim Berend, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek

TL;DR

The paper tackles understanding CLIP's internal mechanisms by jointly learning sparse, interpretable latent components and instance-specific attributions. It introduces SAEs to extract semantically meaningful latent directions, labels them via semantic alignment, and uses instance-wise attributions to quantify each component's influence on predictions, surpassing the limitations of the Logit Lens. The authors demonstrate that hundreds of latent components encode surprising or spurious concepts across CLIP variants and that certain components can unduly drive predictions, including in a melanoma-detection case where background cues bias results. The work provides a scalable, mechanistic interpretability framework, reveals robust and brittle aspects of text-image probing, and offers practical remedies to improve robustness, with code available publicly.

Abstract

Transformer-based CLIP models are widely used for text-image probing and feature extraction, making it relevant to understand the internal mechanisms behind their predictions. While recent works show that Sparse Autoencoders (SAEs) yield interpretable latent components, they focus on what these encode and miss how they drive predictions. We introduce a scalable framework that reveals what latent components activate for, how they align with expected semantics, and how important they are to predictions. To achieve this, we adapt attribution patching for instance-wise component attributions in CLIP and highlight key faithfulness limitations of the widely used Logit Lens technique. By combining attributions with semantic alignment scores, we can automatically uncover reliance on components that encode semantically unexpected or spurious concepts. Applied across multiple CLIP variants, our method uncovers hundreds of surprising components linked to polysemous words, compound nouns, visual typography and dataset artifacts. While text embeddings remain prone to semantic ambiguity, they are more robust to spurious correlations compared to linear classifiers trained on image embeddings. A case study on skin lesion detection highlights how such classifiers can amplify hidden shortcuts, underscoring the need for holistic, mechanistic interpretability. We provide code at https://github.com/maxdreyer/attributing-clip.

From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance

TL;DR

The paper tackles understanding CLIP's internal mechanisms by jointly learning sparse, interpretable latent components and instance-specific attributions. It introduces SAEs to extract semantically meaningful latent directions, labels them via semantic alignment, and uses instance-wise attributions to quantify each component's influence on predictions, surpassing the limitations of the Logit Lens. The authors demonstrate that hundreds of latent components encode surprising or spurious concepts across CLIP variants and that certain components can unduly drive predictions, including in a melanoma-detection case where background cues bias results. The work provides a scalable, mechanistic interpretability framework, reveals robust and brittle aspects of text-image probing, and offers practical remedies to improve robustness, with code available publicly.

Abstract

Transformer-based CLIP models are widely used for text-image probing and feature extraction, making it relevant to understand the internal mechanisms behind their predictions. While recent works show that Sparse Autoencoders (SAEs) yield interpretable latent components, they focus on what these encode and miss how they drive predictions. We introduce a scalable framework that reveals what latent components activate for, how they align with expected semantics, and how important they are to predictions. To achieve this, we adapt attribution patching for instance-wise component attributions in CLIP and highlight key faithfulness limitations of the widely used Logit Lens technique. By combining attributions with semantic alignment scores, we can automatically uncover reliance on components that encode semantically unexpected or spurious concepts. Applied across multiple CLIP variants, our method uncovers hundreds of surprising components linked to polysemous words, compound nouns, visual typography and dataset artifacts. While text embeddings remain prone to semantic ambiguity, they are more robust to spurious correlations compared to linear classifiers trained on image embeddings. A case study on skin lesion detection highlights how such classifiers can amplify hidden shortcuts, underscoring the need for holistic, mechanistic interpretability. We provide code at https://github.com/maxdreyer/attributing-clip.

Paper Structure

This paper contains 32 sections, 29 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: A framework for interpreting CLIP via latent attributions. a) Sparse autoencoders (SAEs) extract a diverse set of interpretable latent components from CLIP representations. Each component is assigned a textual label from an expected set of concepts. b) Instance-wise attribution scores quantify each component’s contribution to model predictions. c) Attribution reveals unexpected model behavior, including components that are unusually predictive relative to a baseline expectation (e.g., on a test set) or lack strong semantic alignment to expected textual labels.
  • Figure 2: Methodological overview: a) For each component, we collect its most activating samples and assign a textual label based on semantic alignment. b) Logit Lens computes a global alignment score by projecting each latent embedding onto a textual embedding. c) Instance-wise attributions are derived by performing a forward pass to obtain predictions and backpropagating gradients to estimate each component’s contribution. In contrast to Logit Lens, this -- applied after the last transformer block -- accounts for component magnitude (including activation) and adds a local correction term.
  • Figure 3: Faithfulness evaluation of latent attributions on the ImageNet test set. We measure output scores while performing latent deletion (setting activations to zero) with instance-wise attributions (left) and average attributions on a subset (middle), and while performing latent insertion (right).
  • Figure 4: Analysis of components extracted by SAEs in CLIP. a) Most latent components have both low activation and low interpretability; activation strongly correlates with interpretability. b) Compared to the inherent neural basis and PCA, SAEs yield a significantly more diverse and specific set of concepts. Larger CLIP models tend to encode a broader range of semantic concepts. c) When probing components with expected labels, a large fraction show weak alignment (compared to alignment between ImageNet-1k class names and test set images). Unexpected concepts include game renderings, cliparts, and visually similar object pairs. More examples can be found in \ref{['app:sec:sae_interpretability']}.
  • Figure 5: Analysis of failure modes in text-image probing. We evaluate CLIP’s robustness to textual ambiguities and dataset artifacts across multiple model variants. Failure cases include polysemous words, compound nouns, typography in images, and spurious correlations. For each scenario, we assess the separability of true images from found distractors ("spurious AUC") and from unrelated ImageNet classes ("valid AUC"). We compare performance using both vanilla and enriched text prompts (e.g., via templating and detailed descriptions), and include linear classifiers trained on image embeddings as a baseline. Uncertainty estimates are provided in \ref{['app:sec:robustness']}.
  • ...and 10 more figures