From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance

Maximilian Dreyer; Lorenz Hufe; Jim Berend; Thomas Wiegand; Sebastian Lapuschkin; Wojciech Samek

From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance

Maximilian Dreyer, Lorenz Hufe, Jim Berend, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek

TL;DR

The paper tackles understanding CLIP's internal mechanisms by jointly learning sparse, interpretable latent components and instance-specific attributions. It introduces SAEs to extract semantically meaningful latent directions, labels them via semantic alignment, and uses instance-wise attributions to quantify each component's influence on predictions, surpassing the limitations of the Logit Lens. The authors demonstrate that hundreds of latent components encode surprising or spurious concepts across CLIP variants and that certain components can unduly drive predictions, including in a melanoma-detection case where background cues bias results. The work provides a scalable, mechanistic interpretability framework, reveals robust and brittle aspects of text-image probing, and offers practical remedies to improve robustness, with code available publicly.

Abstract

Transformer-based CLIP models are widely used for text-image probing and feature extraction, making it relevant to understand the internal mechanisms behind their predictions. While recent works show that Sparse Autoencoders (SAEs) yield interpretable latent components, they focus on what these encode and miss how they drive predictions. We introduce a scalable framework that reveals what latent components activate for, how they align with expected semantics, and how important they are to predictions. To achieve this, we adapt attribution patching for instance-wise component attributions in CLIP and highlight key faithfulness limitations of the widely used Logit Lens technique. By combining attributions with semantic alignment scores, we can automatically uncover reliance on components that encode semantically unexpected or spurious concepts. Applied across multiple CLIP variants, our method uncovers hundreds of surprising components linked to polysemous words, compound nouns, visual typography and dataset artifacts. While text embeddings remain prone to semantic ambiguity, they are more robust to spurious correlations compared to linear classifiers trained on image embeddings. A case study on skin lesion detection highlights how such classifiers can amplify hidden shortcuts, underscoring the need for holistic, mechanistic interpretability. We provide code at https://github.com/maxdreyer/attributing-clip.

From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance

TL;DR

Abstract

From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)