The Generalizability of Explanations
Hanxiao Tan
TL;DR
This work tackles the problem of objectively evaluating post-hoc explanations without ground-truth by introducing a generalizability-based framework. It uses an Autoencoder to learn the distribution of explanations generated by a given method and assesses both learnability (how well explanations can be reconstructed) and distribution proximity (how closely reconstructed explanations resemble the original data distribution). The approach enables quantitative comparisons across gradient-based and perturbation-based explainability methods and reveals that perturbation-based methods, as well as SmoothGrad-enhanced variants, tend to yield more generalizable explanation distributions. The findings offer a practical, model-agnostic metric for selecting and refining explainability techniques with potential implications for trustworthy AI deployment, particularly in high-stakes domains like vision tasks.
Abstract
Due to the absence of ground truth, objective evaluation of explainability methods is an essential research direction. So far, the vast majority of evaluations can be summarized into three categories, namely human evaluation, sensitivity testing, and salinity check. This work proposes a novel evaluation methodology from the perspective of generalizability. We employ an Autoencoder to learn the distributions of the generated explanations and observe their learnability as well as the plausibility of the learned distributional features. We first briefly demonstrate the evaluation idea of the proposed approach at LIME, and then quantitatively evaluate multiple popular explainability methods. We also find that smoothing the explanations with SmoothGrad can significantly enhance the generalizability of explanations.
