Concept activation vectors: a unifying view and adversarial attacks
Ekkehard Schnoor, Malik Tiomoko, Jawher Said, Alex Jung, Wojciech Samek
TL;DR
This work reframes Concept Activation Vectors (CAVs) and TCAV within a probabilistic framework, treating CAVs as random vectors induced by distributions over concept and non-concept inputs. It derives a unifying theory by expressing PatternCAV and FastCAV in terms of class means and covariances, showing ${\mathbb{E}}[w_{pat}] = \mu_2 - \mu_1$ and ${\mathrm{Cov}}(w_{pat}) = \Sigma_1/n_1 + \Sigma_2/n_2$, with ${\mathbb{E}}[\bar{w}_{fast}] = (\mu_2 - \mu_1)/2$ and ${\mathrm{Cov}}(\bar{w}_{fast}) = \Sigma_1/(4n_1) + \Sigma_2/(4n_2)$ in the balanced case. It demonstrates that PatternCAV and FastCAV exhibit close behavior to ridge regression in the large regularization limit and that their classification accuracy can be predicted from Gaussian-distributed projection scores, supported by synthetic, CIFAR-10/ResNet-18, and time-series experiments. The paper also reveals a vulnerability: TCAV scores depend on the non-concept distribution, and it introduces a latent-space adversarial attack to manipulate TCAV explanations, underscoring the need for robust, systematic study of concept-based explanations. Overall, the work provides a unified, theoretically grounded view of CAV variants and highlights practical implications for the reliability of XAI methods in deep learning.
Abstract
Concept Activation Vectors (CAVs) are a tool from explainable AI, offering a promising approach for understanding how human-understandable concepts are encoded in a model's latent spaces. They are computed from hidden-layer activations of inputs belonging either to a concept class or to non-concept examples. Adopting a probabilistic perspective, the distribution of the (non-)concept inputs induces a distribution over the CAV, making it a random vector in the latent space. This enables us to derive mean and covariance for different types of CAVs, leading to a unified theoretical view. This probabilistic perspective also reveals a potential vulnerability: CAVs can strongly depend on the rather arbitrary non-concept distribution, a factor largely overlooked in prior work. We illustrate this with a simple yet effective adversarial attack, underscoring the need for a more systematic study.
