Probing the Probes: Methods and Metrics for Concept Alignment
Jacob Lysnæs-Larsen, Marte Eggen, Inga Strümke
TL;DR
The paper examines a central problem in explainable AI: probe accuracy does not reliably measure concept alignment in CAVs, since probes can rely on spurious correlations. It demonstrates this with FP-CAVs that achieve high accuracy without target-concept data, underscoring the need for alignment-aware evaluation. To address this, it introduces Pattern-CAVs, Segmentation-CAVs, and Combination-CAVs, along with translation-invariant variants and visualization tools (activation maximization, CLMs) and alignment metrics (normal vs hard accuracy, segmentation score, augmentation robustness). Across extensive experiments, translation-invariant and segmentation-based methods generally yield stronger concept alignment and robustness, highlighting the practical importance of alignment-aware probing for trustworthy concept-based explanations.
Abstract
In explainable AI, Concept Activation Vectors (CAVs) are typically obtained by training linear classifier probes to detect human-understandable concepts as directions in the activation space of deep neural networks. It is widely assumed that a high probe accuracy indicates a CAV faithfully representing its target concept. However, we show that the probe's classification accuracy alone is an unreliable measure of concept alignment, i.e., the degree to which a CAV captures the intended concept. In fact, we argue that probes are more likely to capture spurious correlations than they are to represent only the intended concept. As part of our analysis, we demonstrate that deliberately misaligned probes constructed to exploit spurious correlations, achieve an accuracy close to that of standard probes. To address this severe problem, we introduce a novel concept localization method based on spatial linear attribution, and provide a comprehensive comparison of it to existing feature visualization techniques for detecting and mitigating concept misalignment. We further propose three classes of metrics for quantitatively assessing concept alignment: hard accuracy, segmentation scores, and augmentation robustness. Our analysis shows that probes with translation invariance and spatial alignment consistently increase concept alignment. These findings highlight the need for alignment-based evaluation metrics rather than probe accuracy, and the importance of tailoring probes to both the model architecture and the nature of the target concept.
