Table of Contents
Fetching ...

Probing the Probes: Methods and Metrics for Concept Alignment

Jacob Lysnæs-Larsen, Marte Eggen, Inga Strümke

TL;DR

The paper examines a central problem in explainable AI: probe accuracy does not reliably measure concept alignment in CAVs, since probes can rely on spurious correlations. It demonstrates this with FP-CAVs that achieve high accuracy without target-concept data, underscoring the need for alignment-aware evaluation. To address this, it introduces Pattern-CAVs, Segmentation-CAVs, and Combination-CAVs, along with translation-invariant variants and visualization tools (activation maximization, CLMs) and alignment metrics (normal vs hard accuracy, segmentation score, augmentation robustness). Across extensive experiments, translation-invariant and segmentation-based methods generally yield stronger concept alignment and robustness, highlighting the practical importance of alignment-aware probing for trustworthy concept-based explanations.

Abstract

In explainable AI, Concept Activation Vectors (CAVs) are typically obtained by training linear classifier probes to detect human-understandable concepts as directions in the activation space of deep neural networks. It is widely assumed that a high probe accuracy indicates a CAV faithfully representing its target concept. However, we show that the probe's classification accuracy alone is an unreliable measure of concept alignment, i.e., the degree to which a CAV captures the intended concept. In fact, we argue that probes are more likely to capture spurious correlations than they are to represent only the intended concept. As part of our analysis, we demonstrate that deliberately misaligned probes constructed to exploit spurious correlations, achieve an accuracy close to that of standard probes. To address this severe problem, we introduce a novel concept localization method based on spatial linear attribution, and provide a comprehensive comparison of it to existing feature visualization techniques for detecting and mitigating concept misalignment. We further propose three classes of metrics for quantitatively assessing concept alignment: hard accuracy, segmentation scores, and augmentation robustness. Our analysis shows that probes with translation invariance and spatial alignment consistently increase concept alignment. These findings highlight the need for alignment-based evaluation metrics rather than probe accuracy, and the importance of tailoring probes to both the model architecture and the nature of the target concept.

Probing the Probes: Methods and Metrics for Concept Alignment

TL;DR

The paper examines a central problem in explainable AI: probe accuracy does not reliably measure concept alignment in CAVs, since probes can rely on spurious correlations. It demonstrates this with FP-CAVs that achieve high accuracy without target-concept data, underscoring the need for alignment-aware evaluation. To address this, it introduces Pattern-CAVs, Segmentation-CAVs, and Combination-CAVs, along with translation-invariant variants and visualization tools (activation maximization, CLMs) and alignment metrics (normal vs hard accuracy, segmentation score, augmentation robustness). Across extensive experiments, translation-invariant and segmentation-based methods generally yield stronger concept alignment and robustness, highlighting the practical importance of alignment-aware probing for trustworthy concept-based explanations.

Abstract

In explainable AI, Concept Activation Vectors (CAVs) are typically obtained by training linear classifier probes to detect human-understandable concepts as directions in the activation space of deep neural networks. It is widely assumed that a high probe accuracy indicates a CAV faithfully representing its target concept. However, we show that the probe's classification accuracy alone is an unreliable measure of concept alignment, i.e., the degree to which a CAV captures the intended concept. In fact, we argue that probes are more likely to capture spurious correlations than they are to represent only the intended concept. As part of our analysis, we demonstrate that deliberately misaligned probes constructed to exploit spurious correlations, achieve an accuracy close to that of standard probes. To address this severe problem, we introduce a novel concept localization method based on spatial linear attribution, and provide a comprehensive comparison of it to existing feature visualization techniques for detecting and mitigating concept misalignment. We further propose three classes of metrics for quantitatively assessing concept alignment: hard accuracy, segmentation scores, and augmentation robustness. Our analysis shows that probes with translation invariance and spatial alignment consistently increase concept alignment. These findings highlight the need for alignment-based evaluation metrics rather than probe accuracy, and the importance of tailoring probes to both the model architecture and the nature of the target concept.

Paper Structure

This paper contains 29 sections, 22 equations, 17 figures, 2 tables, 1 algorithm.

Figures (17)

  • Figure 1: Illustration of the procedure used to create FP-CAVs, which demonstrates the prevalence of concept misalignment. We first train a standard classifier probe on the concept horse and collect a set of false positives. We then train a second classifier on those false positives, using incorrect target labels. When both classifiers are tested on the same held-out dataset, they achieve similar classification accuracy. Additionally, the two corresponding CAVs have a high cosine similarity. This shows that classifier probes learn significant amounts of spurious correlations and that accuracy is unreliable for measuring concept alignment.
  • Figure 2: Comparison between Classifier- and FP-CAVs trained on the same concepts, with each sample corresponding to a concept. The results show (\ref{['fig:fp_accuracy']}) similar classification accuracies and (\ref{['fig:fp_cosim']}) high cosine similarities between Classifier- and FP-CAVs.
  • Figure 3: Activation maximization visualizations for Classifier- (left), FP- (middle), and curated CAV (right) for the concept car.
  • Figure 4: Top-$k$ prototypical examples for the concept car using Classifier- (top row) and Segmentation-CAVs (bottom row). Classifier-CAVs activate for broader scene elements (e.g., roads, vegetation, buildings), while Segmentation-CAVs represent object-specific features (e.g., shiny metal parts).
  • Figure 5: Synthetic images generated via activation maximization for the six concepts pool table, building, boat, car, cat, and horse, as columns, comparing Classifier- (top row) and Segmentation-CAVs (bottom row).
  • ...and 12 more figures