Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Been Kim; Martin Wattenberg; Justin Gilmer; Carrie Cai; James Wexler; Fernanda Viegas; Rory Sayres

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, Rory Sayres

TL;DR

The paper addresses interpretability by moving beyond feature attribution to concept-based explanations using Concept Activation Vectors (CAVs). It introduces Testing with CAVs (TCAV), which uses directional derivatives to quantify a model’s sensitivity to user-defined concepts across classes, with statistical testing and a relative-CAV extension for fine-grained comparisons. Through experiments on standard image classifiers and a diabetic retinopathy task, it demonstrates global, human-aligned explanations and reveals biases that saliency methods may miss, supported by a controlled ground-truth study and human-subject evaluation. The work positions TCAV as a flexible, plug-in tool for post-hoc analysis that can be extended to other modalities and adversarial contexts, enabling more actionable model understanding and debugging.

Abstract

The interpretation of deep learning models is a challenge due to their size, complexity, and often opaque internal state. In addition, many systems, such as image classifiers, operate on low-level features rather than high-level concepts. To address these challenges, we introduce Concept Activation Vectors (CAVs), which provide an interpretation of a neural net's internal state in terms of human-friendly concepts. The key idea is to view the high-dimensional internal state of a neural net as an aid, not an obstacle. We show how to use CAVs as part of a technique, Testing with CAVs (TCAV), that uses directional derivatives to quantify the degree to which a user-defined concept is important to a classification result--for example, how sensitive a prediction of "zebra" is to the presence of stripes. Using the domain of image classification as a testing ground, we describe how CAVs may be used to explore hypotheses and generate insights for a standard image classification network as well as a medical application.

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

TL;DR

Abstract

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (19)