Table of Contents
Fetching ...

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, Rory Sayres

TL;DR

The paper addresses interpretability by moving beyond feature attribution to concept-based explanations using Concept Activation Vectors (CAVs). It introduces Testing with CAVs (TCAV), which uses directional derivatives to quantify a model’s sensitivity to user-defined concepts across classes, with statistical testing and a relative-CAV extension for fine-grained comparisons. Through experiments on standard image classifiers and a diabetic retinopathy task, it demonstrates global, human-aligned explanations and reveals biases that saliency methods may miss, supported by a controlled ground-truth study and human-subject evaluation. The work positions TCAV as a flexible, plug-in tool for post-hoc analysis that can be extended to other modalities and adversarial contexts, enabling more actionable model understanding and debugging.

Abstract

The interpretation of deep learning models is a challenge due to their size, complexity, and often opaque internal state. In addition, many systems, such as image classifiers, operate on low-level features rather than high-level concepts. To address these challenges, we introduce Concept Activation Vectors (CAVs), which provide an interpretation of a neural net's internal state in terms of human-friendly concepts. The key idea is to view the high-dimensional internal state of a neural net as an aid, not an obstacle. We show how to use CAVs as part of a technique, Testing with CAVs (TCAV), that uses directional derivatives to quantify the degree to which a user-defined concept is important to a classification result--for example, how sensitive a prediction of "zebra" is to the presence of stripes. Using the domain of image classification as a testing ground, we describe how CAVs may be used to explore hypotheses and generate insights for a standard image classification network as well as a medical application.

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

TL;DR

The paper addresses interpretability by moving beyond feature attribution to concept-based explanations using Concept Activation Vectors (CAVs). It introduces Testing with CAVs (TCAV), which uses directional derivatives to quantify a model’s sensitivity to user-defined concepts across classes, with statistical testing and a relative-CAV extension for fine-grained comparisons. Through experiments on standard image classifiers and a diabetic retinopathy task, it demonstrates global, human-aligned explanations and reveals biases that saliency methods may miss, supported by a controlled ground-truth study and human-subject evaluation. The work positions TCAV as a flexible, plug-in tool for post-hoc analysis that can be extended to other modalities and adversarial contexts, enabling more actionable model understanding and debugging.

Abstract

The interpretation of deep learning models is a challenge due to their size, complexity, and often opaque internal state. In addition, many systems, such as image classifiers, operate on low-level features rather than high-level concepts. To address these challenges, we introduce Concept Activation Vectors (CAVs), which provide an interpretation of a neural net's internal state in terms of human-friendly concepts. The key idea is to view the high-dimensional internal state of a neural net as an aid, not an obstacle. We show how to use CAVs as part of a technique, Testing with CAVs (TCAV), that uses directional derivatives to quantify the degree to which a user-defined concept is important to a classification result--for example, how sensitive a prediction of "zebra" is to the presence of stripes. Using the domain of image classification as a testing ground, we describe how CAVs may be used to explore hypotheses and generate insights for a standard image classification network as well as a medical application.

Paper Structure

This paper contains 27 sections, 3 equations, 19 figures.

Figures (19)

  • Figure 1: Testing with Concept Activation Vectors: Given a user-defined set of examples for a concept (e.g., 'striped'), and random examples ⓐ, labeled training-data examples for the studied class (zebras) ⓑ, and a trained network ⓒ, TCAV can quantify the model's sensitivity to the concept for that class. CAVs are learned by training a linear classifier to distinguish between the activations produced by a concept's examples and examples in any layer ⓓ. The CAV is the vector orthogonal to the classification boundary ($v_C^l$, red arrow). For the class of interest (zebras), TCAV uses the directional derivative $S_{C, k, l}(\bm{x})$ to quantify conceptual sensitivity ⓔ.
  • Figure 2: The most and least similar pictures of stripes using 'CEO' concept (left) and neckties using 'model women' concept (right)
  • Figure 3: Empirical Deepdream using knitted texture, corgis and Siberian huskey concept vectors (zoomed-in)
  • Figure 4: Relative TCAV for all layers in GoogleNet szegedy2015going and last three layers in Inception V3 szegedy2016rethinking for confirmation (e.g., fire engine), discovering biases (e.g., rugby, apron), and quantitative confirmation for previously qualitative findings in mordvintsev2015inceptionismstock2017convnets (e.g., dumbbell, ping-pong ball). TCAVqs in layers close to the logit layer (red) represent more direct influence on the prediction than lower layers. '*'s mark CAVs omitted after statistical testing.
  • Figure 5: The accuracies of CAVs at each layer. Simple concepts (e.g., colors) achieve higher performance in lower-layers than more abstract or complex concepts (e.g. people, objects)
  • ...and 14 more figures