Measuring the (Un)Faithfulness of Concept-Based Explanations
Shubham Kumar, Narendra Ahuja
TL;DR
This work critiques how faithfulness of unsupervised concept-based explanations (U-CBEMs) has been measured, arguing that prior surrogates and deletion proxies can overstate faithfulness while remaining interpretable. It introduces SURF, a simple, linear surrogate that uses concept activations and importances to predict model outputs, paired with two cross-output metrics (SURF_MAE and SURF_EMD) to gauge faithfulness across all classes. A measure-over-measure sanity check shows SURF uniquely preserves faithfulness under increasing randomization, enabling a fair benchmark across six U-CBEMs on multiple tasks; results reveal that many state-of-the-art U-CBEMs are not faithful. SURF also provides a principled way to choose the number of concepts, demonstrating different trade-offs between interpretability and fidelity. The authors suggest adopting SURF as a standard faithfulness benchmark for future work and release the code publicly.
Abstract
Deep vision models perform input-output computations that are hard to interpret. Concept-based explanation methods (CBEMs) increase interpretability by re-expressing parts of the model with human-understandable semantic units, or concepts. Checking if the derived explanations are faithful -- that is, they represent the model's internal computation -- requires a surrogate that combines concepts to compute the output. Simplifications made for interpretability inevitably reduce faithfulness, resulting in a tradeoff between the two. State-of-the-art unsupervised CBEMs (U-CBEMs) have reported increasingly interpretable concepts, while also being more faithful to the model. However, we observe that the reported improvement in faithfulness artificially results from either (1) using overly complex surrogates, which introduces an unmeasured cost to the explanation's interpretability, or (2) relying on deletion-based approaches that, as we demonstrate, do not properly measure faithfulness. We propose Surrogate Faithfulness (SURF), which (1) replaces prior complex surrogates with a simple, linear surrogate that measures faithfulness without changing the explanation's interpretability and (2) introduces well-motivated metrics that assess loss across all output classes, not just the predicted class. We validate SURF with a measure-over-measure study by proposing a simple sanity check -- explanations with random concepts should be less faithful -- which prior surrogates fail. SURF enables the first reliable faithfulness benchmark of U-CBEMs, revealing that many visually compelling U-CBEMs are not faithful. Code to be released.
