Concept Visualization: Explaining the CLIP Multi-modal Embedding Using WordNet

Loris Giulivi; Giacomo Boracchi

Concept Visualization: Explaining the CLIP Multi-modal Embedding Using WordNet

Loris Giulivi, Giacomo Boracchi

TL;DR

This work tackles the opacity of CLIP-based vision backbones by introducing Concept Visualization (ConVis), a WordNet-guided, task-agnostic saliency method that explains CLIP embeddings through image-region–concept associations. It defines $z(\tilde{\mathbf{x}}, \mathbf{s})$, $rank\_sim(\tilde{\mathbf{x}}, \mathbf{s})$, and $max\_rank\_sim(\tilde{\mathbf{x}}, \mathbf{s})$ to map image patches to WordNet synsets via CLIP-space definitions, producing pixel-level saliency maps for any synset in the WordNet hierarchy. The authors validate ConVis with OOD detection, weakly supervised object localization (WSOL), and a user study, showing that WordNet definitions align well with CLIP representations and that explanations provide meaningful insight into CLIP’s semantic understanding. This approach broadens explainability for multimodal backbones, enabling concept-level interpretations beyond the model’s training classes and potentially improving trust in domains demanding transparency.

Abstract

Advances in multi-modal embeddings, and in particular CLIP, have recently driven several breakthroughs in Computer Vision (CV). CLIP has shown impressive performance on a variety of tasks, yet, its inherently opaque architecture may hinder the application of models employing CLIP as backbone, especially in fields where trust and model explainability are imperative, such as in the medical domain. Current explanation methodologies for CV models rely on Saliency Maps computed through gradient analysis or input perturbation. However, these Saliency Maps can only be computed to explain classes relevant to the end task, often smaller in scope than the backbone training classes. In the context of models implementing CLIP as their vision backbone, a substantial portion of the information embedded within the learned representations is thus left unexplained. In this work, we propose Concept Visualization (ConVis), a novel saliency methodology that explains the CLIP embedding of an image by exploiting the multi-modal nature of the embeddings. ConVis makes use of lexical information from WordNet to compute task-agnostic Saliency Maps for any concept, not limited to concepts the end model was trained on. We validate our use of WordNet via an out of distribution detection experiment, and test ConVis on an object localization benchmark, showing that Concept Visualizations correctly identify and localize the image's semantic content. Additionally, we perform a user study demonstrating that our methodology can give users insight on the model's functioning.

Concept Visualization: Explaining the CLIP Multi-modal Embedding Using WordNet

TL;DR

, and

to map image patches to WordNet synsets via CLIP-space definitions, producing pixel-level saliency maps for any synset in the WordNet hierarchy. The authors validate ConVis with OOD detection, weakly supervised object localization (WSOL), and a user study, showing that WordNet definitions align well with CLIP representations and that explanations provide meaningful insight into CLIP’s semantic understanding. This approach broadens explainability for multimodal backbones, enabling concept-level interpretations beyond the model’s training classes and potentially improving trust in domains demanding transparency.

Abstract

Paper Structure (6 sections, 11 equations, 6 figures, 2 tables)

This paper contains 6 sections, 11 equations, 6 figures, 2 tables.

Introduction
Related Work
Background
Concept Visualization
Experiments and Results
Conclusions and Future Works

Figures (6)

Figure 1: Example of Concept Visualizations, computed for a variety of WordNet synsets. We mask the image based on the saliency value. ConVis highlights the regions of the image that relate to any WordNet synset.
Figure 2: Concept Visualization computation diagram. The image patches at two different scales are encoded through CLIP's image embedding network $\mathcal{E}^I$, and concept definitions for synsets in $\mathbb{S}$ are encoded through CLIP's text embedding network $\mathcal{E}^T$. We then compute the $max\_rank\_sim$ and obtain scores $\mathbf{r}_{i,j}$ for each region in the image. The Saliency Map pixel $\mathbf{y}[i,j]$ is computed by averaging all scores $\mathbf{r}_{l,m}$ computed from patches that contained pixel $\mathbf{x}[i,j]$. The hatched rectangle on the right displays all the locations that satisfy this property.
Figure 3: WSOL evaluation. We evaluate localization accuracy by computing the number of samples for which the $IoU$ between ground truth and Saliency Map is above $0.5$. To compute $IoU$, the Saliency Maps are thresholded at value $\tau$, which is optimized over the dataset.
Figure 4: Example question from the survey. ① The user can explore Saliency Maps for concepts hierarchically, ② can click on a Saliency Map of a concept to explore sub-concepts, ③ or search for specific concepts. ④ The user has to answer four questions, each one having one out of four correct answers in the form of COCO captions.
Figure 5: Results of the user study. We report the number of correct answers, out of four questions, for all 18 participants.
...and 1 more figures

Concept Visualization: Explaining the CLIP Multi-modal Embedding Using WordNet

TL;DR

Abstract

Concept Visualization: Explaining the CLIP Multi-modal Embedding Using WordNet

Authors

TL;DR

Abstract

Table of Contents

Figures (6)