Table of Contents
Fetching ...

Concept activation vectors: a unifying view and adversarial attacks

Ekkehard Schnoor, Malik Tiomoko, Jawher Said, Alex Jung, Wojciech Samek

TL;DR

This work reframes Concept Activation Vectors (CAVs) and TCAV within a probabilistic framework, treating CAVs as random vectors induced by distributions over concept and non-concept inputs. It derives a unifying theory by expressing PatternCAV and FastCAV in terms of class means and covariances, showing ${\mathbb{E}}[w_{pat}] = \mu_2 - \mu_1$ and ${\mathrm{Cov}}(w_{pat}) = \Sigma_1/n_1 + \Sigma_2/n_2$, with ${\mathbb{E}}[\bar{w}_{fast}] = (\mu_2 - \mu_1)/2$ and ${\mathrm{Cov}}(\bar{w}_{fast}) = \Sigma_1/(4n_1) + \Sigma_2/(4n_2)$ in the balanced case. It demonstrates that PatternCAV and FastCAV exhibit close behavior to ridge regression in the large regularization limit and that their classification accuracy can be predicted from Gaussian-distributed projection scores, supported by synthetic, CIFAR-10/ResNet-18, and time-series experiments. The paper also reveals a vulnerability: TCAV scores depend on the non-concept distribution, and it introduces a latent-space adversarial attack to manipulate TCAV explanations, underscoring the need for robust, systematic study of concept-based explanations. Overall, the work provides a unified, theoretically grounded view of CAV variants and highlights practical implications for the reliability of XAI methods in deep learning.

Abstract

Concept Activation Vectors (CAVs) are a tool from explainable AI, offering a promising approach for understanding how human-understandable concepts are encoded in a model's latent spaces. They are computed from hidden-layer activations of inputs belonging either to a concept class or to non-concept examples. Adopting a probabilistic perspective, the distribution of the (non-)concept inputs induces a distribution over the CAV, making it a random vector in the latent space. This enables us to derive mean and covariance for different types of CAVs, leading to a unified theoretical view. This probabilistic perspective also reveals a potential vulnerability: CAVs can strongly depend on the rather arbitrary non-concept distribution, a factor largely overlooked in prior work. We illustrate this with a simple yet effective adversarial attack, underscoring the need for a more systematic study.

Concept activation vectors: a unifying view and adversarial attacks

TL;DR

This work reframes Concept Activation Vectors (CAVs) and TCAV within a probabilistic framework, treating CAVs as random vectors induced by distributions over concept and non-concept inputs. It derives a unifying theory by expressing PatternCAV and FastCAV in terms of class means and covariances, showing and , with and in the balanced case. It demonstrates that PatternCAV and FastCAV exhibit close behavior to ridge regression in the large regularization limit and that their classification accuracy can be predicted from Gaussian-distributed projection scores, supported by synthetic, CIFAR-10/ResNet-18, and time-series experiments. The paper also reveals a vulnerability: TCAV scores depend on the non-concept distribution, and it introduces a latent-space adversarial attack to manipulate TCAV explanations, underscoring the need for robust, systematic study of concept-based explanations. Overall, the work provides a unified, theoretically grounded view of CAV variants and highlights practical implications for the reliability of XAI methods in deep learning.

Abstract

Concept Activation Vectors (CAVs) are a tool from explainable AI, offering a promising approach for understanding how human-understandable concepts are encoded in a model's latent spaces. They are computed from hidden-layer activations of inputs belonging either to a concept class or to non-concept examples. Adopting a probabilistic perspective, the distribution of the (non-)concept inputs induces a distribution over the CAV, making it a random vector in the latent space. This enables us to derive mean and covariance for different types of CAVs, leading to a unified theoretical view. This probabilistic perspective also reveals a potential vulnerability: CAVs can strongly depend on the rather arbitrary non-concept distribution, a factor largely overlooked in prior work. We illustrate this with a simple yet effective adversarial attack, underscoring the need for a more systematic study.

Paper Structure

This paper contains 6 sections, 2 theorems, 15 equations, 4 figures.

Key Result

Theorem 1

Let ${\mathbf{w}} \in \mathbb{R}^d$ be the (random) weight vector of the linear model $g$ in eq:general_linear_classifier_g, with mean $\bar{{\mathbf{w}}} = \mathbb{E}[{\mathbf{w}}]$ and covariance ${\mathbf{\Sigma }}_{\mathbf{w}} = \mathrm{Cov}({\mathbf{w}})$. Assume $g({\mathbf{x}})$ to have class Consequently, the (optimal) classification accuracy of eq:general_linear_classifier_g is given by $

Figures (4)

  • Figure 1: Illust. of normal distributions of $g({\mathbf{x}}) = {\mathbf{x}}^\top {\mathbf{w}}$ for ${\mathbf{x}} \in \mathcal{C}_1$ (red) and ${\mathbf{x}} \in \mathcal{C}_2$ (blue) with optimal decision threshold $\eta^\star$ at the intersection of the density function between the means.
  • Figure 2: Accurate theoretical prediction of the classification error $\varepsilon$ from \ref{['eq:error_of_misclassification']} as a function of $\lambda$ (ridge regression); different colors refer to different methods \ref{['fig:rr_pattern_fast_theory_emp_synthetic']} or different layers \ref{['fig:time_series_ridge_regression_lambda']}.
  • Figure 3: Close match between test-set histograms and the Gaussian predictions of the CAV projection for the concept $\textit{blue}$ vs. noise${\mathbf{x}} \sim \mathcal{N}(\bm{0}, {\mathbf{I}})$; ridge-regression (fixed $\lambda$), for different layers (ResNet-18 model for the CIFAR-10 dataset).
  • Figure 4: Histograms of $S_{C,k,l} ({\mathbf{x}})$ from \ref{['eq:TCAV_sensitivity_inner_product']} with ${\mathbf{x}}$ from $3$ classes (colors) at layer layer3.0.conv2 of the ResNet-18 model for the CIFAR-10 dataset; using two different choices of $S = (s_1,s_2,s_3)$ to manipulate the CAV, and thus the histograms of the scores $S_{C,k,l} ({\mathbf{x}})$, as well as $\operatorname{TCAV_Q}_{C,k,l}$.

Theorems & Definitions (5)

  • Theorem 1
  • proof
  • Proposition 1
  • proof
  • Remark 1: Relationship between ${{\mathbf{w}}}_{{\operatorname{fast}}}$, ${{\mathbf{w}}}_{{\operatorname{pat}}}$ and ${{\mathbf{w}}}_{{\operatorname{ridge}}}$