Explaining Explainability: Recommendations for Effective Use of Concept Activation Vectors
Angus Nicolson, Lisa Schut, J. Alison Noble, Yarin Gal
TL;DR
This paper investigates three properties that can distort CAV-based explanations: layer inconsistency, entanglement, and spatial dependence. It formalizes hypotheses, develops tools to detect each property, and demonstrates their impact via experiments on Elements and ImageNet, including a melanoma use-case. The Elements synthetic dataset enables controlled study of ground-truth concept–class relationships and concept entanglement, while spatially dependent CAVs enable testing translation invariance. The authors provide concrete practitioner recommendations—use multiple layers, verify concept dependencies, and visualise spatial dependence—and release Elements to facilitate further research in interpretability methods. Overall, the work clarifies when CAVs produce reliable explanations and how to diagnose and leverage their properties for deeper model understanding.
Abstract
Concept-based explanations translate the internal representations of deep learning models into a language that humans are familiar with: concepts. One popular method for finding concepts is Concept Activation Vectors (CAVs), which are learnt using a probe dataset of concept exemplars. In this work, we investigate three properties of CAVs: (1) inconsistency across layers, (2) entanglement with other concepts, and (3) spatial dependency. Each property provides both challenges and opportunities in interpreting models. We introduce tools designed to detect the presence of these properties, provide insight into how each property can lead to misleading explanations, and provide recommendations to mitigate their impact. To demonstrate practical applications, we apply our recommendations to a melanoma classification task, showing how entanglement can lead to uninterpretable results and that the choice of negative probe set can have a substantial impact on the meaning of a CAV. Further, we show that understanding these properties can be used to our advantage. For example, we introduce spatially dependent CAVs to test if a model is translation invariant with respect to a specific concept and class. Our experiments are performed on natural images (ImageNet), skin lesions (ISIC 2019), and a new synthetic dataset, Elements. Elements is designed to capture a known ground truth relationship between concepts and classes. We release this dataset to facilitate further research in understanding and evaluating interpretability methods.
