Table of Contents
Fetching ...

Explaining Explainability: Recommendations for Effective Use of Concept Activation Vectors

Angus Nicolson, Lisa Schut, J. Alison Noble, Yarin Gal

TL;DR

This paper investigates three properties that can distort CAV-based explanations: layer inconsistency, entanglement, and spatial dependence. It formalizes hypotheses, develops tools to detect each property, and demonstrates their impact via experiments on Elements and ImageNet, including a melanoma use-case. The Elements synthetic dataset enables controlled study of ground-truth concept–class relationships and concept entanglement, while spatially dependent CAVs enable testing translation invariance. The authors provide concrete practitioner recommendations—use multiple layers, verify concept dependencies, and visualise spatial dependence—and release Elements to facilitate further research in interpretability methods. Overall, the work clarifies when CAVs produce reliable explanations and how to diagnose and leverage their properties for deeper model understanding.

Abstract

Concept-based explanations translate the internal representations of deep learning models into a language that humans are familiar with: concepts. One popular method for finding concepts is Concept Activation Vectors (CAVs), which are learnt using a probe dataset of concept exemplars. In this work, we investigate three properties of CAVs: (1) inconsistency across layers, (2) entanglement with other concepts, and (3) spatial dependency. Each property provides both challenges and opportunities in interpreting models. We introduce tools designed to detect the presence of these properties, provide insight into how each property can lead to misleading explanations, and provide recommendations to mitigate their impact. To demonstrate practical applications, we apply our recommendations to a melanoma classification task, showing how entanglement can lead to uninterpretable results and that the choice of negative probe set can have a substantial impact on the meaning of a CAV. Further, we show that understanding these properties can be used to our advantage. For example, we introduce spatially dependent CAVs to test if a model is translation invariant with respect to a specific concept and class. Our experiments are performed on natural images (ImageNet), skin lesions (ISIC 2019), and a new synthetic dataset, Elements. Elements is designed to capture a known ground truth relationship between concepts and classes. We release this dataset to facilitate further research in understanding and evaluating interpretability methods.

Explaining Explainability: Recommendations for Effective Use of Concept Activation Vectors

TL;DR

This paper investigates three properties that can distort CAV-based explanations: layer inconsistency, entanglement, and spatial dependence. It formalizes hypotheses, develops tools to detect each property, and demonstrates their impact via experiments on Elements and ImageNet, including a melanoma use-case. The Elements synthetic dataset enables controlled study of ground-truth concept–class relationships and concept entanglement, while spatially dependent CAVs enable testing translation invariance. The authors provide concrete practitioner recommendations—use multiple layers, verify concept dependencies, and visualise spatial dependence—and release Elements to facilitate further research in interpretability methods. Overall, the work clarifies when CAVs produce reliable explanations and how to diagnose and leverage their properties for deeper model understanding.

Abstract

Concept-based explanations translate the internal representations of deep learning models into a language that humans are familiar with: concepts. One popular method for finding concepts is Concept Activation Vectors (CAVs), which are learnt using a probe dataset of concept exemplars. In this work, we investigate three properties of CAVs: (1) inconsistency across layers, (2) entanglement with other concepts, and (3) spatial dependency. Each property provides both challenges and opportunities in interpreting models. We introduce tools designed to detect the presence of these properties, provide insight into how each property can lead to misleading explanations, and provide recommendations to mitigate their impact. To demonstrate practical applications, we apply our recommendations to a melanoma classification task, showing how entanglement can lead to uninterpretable results and that the choice of negative probe set can have a substantial impact on the meaning of a CAV. Further, we show that understanding these properties can be used to our advantage. For example, we introduce spatially dependent CAVs to test if a model is translation invariant with respect to a specific concept and class. Our experiments are performed on natural images (ImageNet), skin lesions (ISIC 2019), and a new synthetic dataset, Elements. Elements is designed to capture a known ground truth relationship between concepts and classes. We release this dataset to facilitate further research in understanding and evaluating interpretability methods.
Paper Structure (79 sections, 35 equations, 40 figures, 3 tables)

This paper contains 79 sections, 35 equations, 40 figures, 3 tables.

Figures (40)

  • Figure 1: Concept Activation Vectors can be: inconsistent across layers, i.e., we cannot find two concept vectors in different layers that have the same additive effect (left), entangled (middle) and spatially dependent (right). The top panel illustrates each of these different properties. The bottom panels show our recommendations on how to mitigate the impact these effects can have: creating CAVs for multiple layers (left), verifying expected dependencies between related concepts (middle), and visualising spatial dependence (right).
  • Figure 2: Example images from Elements probe datasets. (a) Negative probe set. A random selection of images -- equivalent to images found in the model training set. (b) Positive probe set for stripes. (c) Positive probe set for stripes on the left. (d) Positive probe set for stripes on the right.
  • Figure 3: Empirical evidence for inconsistent CAVs across layers. The consistency error for different ${\bm{v}}_{c,l_2}$ for striped in the penultimate convolutional layer of a ResNet-50 trained on ImageNet. The optimised CAV acts as lower bound, whereas the random CAV and Direction act as baselines that provide an intuitive upper bounds. Concept CAV: striped CAVs, trained as normal. Projected CAV: striped CAVs from layer $l_1$ projected into layer $l_2$, $f({\bm{v}}_{c,l_1})$.
  • Figure 4: Cosine similarities demonstrating entangled concepts. Mean pairwise cosine similarities for all concepts from different versions of the simple Elements dataset, with an increasing association between red and triangle from left to right: $\mathbb{E}_1$, $\mathbb{E}_2$ and $\mathbb{E}_3$.
  • Figure 5: Consistency, entanglement, spatial dependence can affect TCAV scores. The standard deviation is black or red for significant and insignificant results, respectively. The null for each layer is shown as a horizontal black line.
  • ...and 35 more figures

Theorems & Definitions (4)

  • Definition 1: layer consistency
  • Definition 2: entangled concepts
  • Definition 3: activation spatial dependence
  • Definition 4: concept vector spatial dependence