Table of Contents
Fetching ...

Exploiting Text-Image Latent Spaces for the Description of Visual Concepts

Laines Schmalwasser, Jakob Gawlikowski, Joachim Denzler, Julia Niebling

TL;DR

This work introduces an approach to aid the interpretation of newly discovered concept sets by suggesting textual descriptions for each CAV by utilizing the most relevant receptive fields instead of full images encoded.

Abstract

Concept Activation Vectors (CAVs) offer insights into neural network decision-making by linking human friendly concepts to the model's internal feature extraction process. However, when a new set of CAVs is discovered, they must still be translated into a human understandable description. For image-based neural networks, this is typically done by visualizing the most relevant images of a CAV, while the determination of the concept is left to humans. In this work, we introduce an approach to aid the interpretation of newly discovered concept sets by suggesting textual descriptions for each CAV. This is done by mapping the most relevant images representing a CAV into a text-image embedding where a joint description of these relevant images can be computed. We propose utilizing the most relevant receptive fields instead of full images encoded. We demonstrate the capabilities of this approach in multiple experiments with and without given CAV labels, showing that the proposed approach provides accurate descriptions for the CAVs and reduces the challenge of concept interpretation.

Exploiting Text-Image Latent Spaces for the Description of Visual Concepts

TL;DR

This work introduces an approach to aid the interpretation of newly discovered concept sets by suggesting textual descriptions for each CAV by utilizing the most relevant receptive fields instead of full images encoded.

Abstract

Concept Activation Vectors (CAVs) offer insights into neural network decision-making by linking human friendly concepts to the model's internal feature extraction process. However, when a new set of CAVs is discovered, they must still be translated into a human understandable description. For image-based neural networks, this is typically done by visualizing the most relevant images of a CAV, while the determination of the concept is left to humans. In this work, we introduce an approach to aid the interpretation of newly discovered concept sets by suggesting textual descriptions for each CAV. This is done by mapping the most relevant images representing a CAV into a text-image embedding where a joint description of these relevant images can be computed. We propose utilizing the most relevant receptive fields instead of full images encoded. We demonstrate the capabilities of this approach in multiple experiments with and without given CAV labels, showing that the proposed approach provides accurate descriptions for the CAVs and reduces the challenge of concept interpretation.

Paper Structure

This paper contains 16 sections, 8 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Examples of two CAVs computed from the first residual block of a ResNet50, trained on Animals with Attributes 2 xian2018zero. The first row of each sub-fig-ure shows the full representative images of the CAVs and the textual descriptions generated based on the full images. The second row shows the rep-re-sen-ta-tive receptive fields for the same CAVs and the textual descriptions are derived from the receptive fields.
  • Figure 1: Comparison of the approaches to generate textual descriptions. Shown are influential CAVs for the class "dog" after the first residual block of a ConvMixer trockman2023patches. The probing dataset is the validation set from ImageNet imagenet and the concept set is google20k First20hoursGoogle10000englishThis
  • Figure 2: Overview of our approach to describe the layer $l$ of a pretrained model $f$. The inputs are a concept discovery method, a probing set $\color{input}D_{probe}$, and a set of textual descriptions $\color{input}T$. We apply concept discovery methods to find a set of CAVs, generate a set of visual concept descriptions$Q_j$ for each CAV $c_j$, then textual concept descriptions and finally output the top-$k$ descriptions $\color{output}T_k \subseteq\textcolor{input}{T}$.
  • Figure 2: Description of the classes "hamster", "zebra", "raccoon" and "rabbit" according to a set of CAVs. For each class, the textual descriptions and the most activated receptive fields of the CAVs with the strongest influence are shown. The image set was selected by $F_{mean\rightarrow max}$ . The set of CAVs describes the hidden representation after the first residual block of a ResNet50 krizhevsky2009learning finetuned on AwA2 xian2018zero. The probing dataset is the validation set from ImageNet imagenet and the concept set is google20kFirst20hoursGoogle10000englishThis
  • Figure 3: Selection of the visual representations for a given CAV $c_j$, compare with \ref{['fig:MethodConceptTextMatching']} column Visual Concept Description. The vector $(v^1_j(\hat{x}^1_i), \dots, v^F_j(\hat{x}^F_i))$ represents the concept scores between each receptive field of $x_i$ and the CAV $c_j$. While oikarinenCLIPDissectAutomaticDescription2023 select full images based on the mean score of all receptive fields, we also consider the receptive field with the highest concept score. Thus, we improve the visual input of the joint vision-text embedding by cropping $x_i$ to the respective receptive field. This creates a more truthful and more detailed representation of the concepts learned in the hidden space.
  • ...and 2 more figures