Table of Contents
Fetching ...

Explaining CLIP Zero-shot Predictions Through Concepts

Onat Ozdemir, Anders Christensen, Stephan Alaniz, Zeynep Akata, Emre Akbas

Abstract

Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP's zero-shot predictions through human-understandable concepts. Our method projects CLIP's joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP's semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP's strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models. Code is available at https://github.com/oonat/ezpc.

Explaining CLIP Zero-shot Predictions Through Concepts

Abstract

Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP's zero-shot predictions through human-understandable concepts. Our method projects CLIP's joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP's semantic structure while remaining interpretable. Extensive experiments on five benchmark datasets, CIFAR-100, CUB-200-2011, Places365, ImageNet-100, and ImageNet-1k, demonstrate that our approach maintains CLIP's strong zero-shot classification accuracy while providing meaningful concept-level explanations. By grounding open-vocabulary predictions in explicit semantic concepts, our method offers a principled step toward interpretable and trustworthy vision-language models. Code is available at https://github.com/oonat/ezpc.

Paper Structure

This paper contains 46 sections, 21 equations, 22 figures, 12 tables.

Figures (22)

  • Figure 1: Overview of EZPC. CLIP image and text embeddings are projected into a shared concept space using a learnable matrix $A$. The projected representations $c_x$ and $c_k$ provide (i) concept-based explanations via a Hadamard product and (ii) class logits via a dot-product in concept space. Training jointly optimizes a matching loss and a reconstruction loss to preserve CLIP’s predictive behavior.
  • Figure 2: Qualitative comparison of image-level explanations for different $\lambda$ values. For $\lambda=1$, EZPC produces semantically consistent concept activations. For larger values (e.g., $\lambda=100$), unrelated concepts appear among the top activations.
  • Figure 3: Image-level Explanations. For each image, EZPC displays the top-10 activated concepts that contribute most to the zero-shot prediction. The highlighted concepts closely correspond to salient visual characteristics of the input images.
  • Figure 4: Class-level Concept Explanations. For each class, we average concept activations over nine sampled images. EZPC produces coherent class signatures, highlighting concepts that characterize each category.
  • Figure 5: Concept Clustering Results. For each concept, we retrieve the nine images with the highest activation. Clusters from ImageNet-100 and Places365 show coherent semantic structure, indicating that EZPC learns interpretable concept directions.
  • ...and 17 more figures