Table of Contents
Fetching ...

The Generalizability of Explanations

Hanxiao Tan

TL;DR

This work tackles the problem of objectively evaluating post-hoc explanations without ground-truth by introducing a generalizability-based framework. It uses an Autoencoder to learn the distribution of explanations generated by a given method and assesses both learnability (how well explanations can be reconstructed) and distribution proximity (how closely reconstructed explanations resemble the original data distribution). The approach enables quantitative comparisons across gradient-based and perturbation-based explainability methods and reveals that perturbation-based methods, as well as SmoothGrad-enhanced variants, tend to yield more generalizable explanation distributions. The findings offer a practical, model-agnostic metric for selecting and refining explainability techniques with potential implications for trustworthy AI deployment, particularly in high-stakes domains like vision tasks.

Abstract

Due to the absence of ground truth, objective evaluation of explainability methods is an essential research direction. So far, the vast majority of evaluations can be summarized into three categories, namely human evaluation, sensitivity testing, and salinity check. This work proposes a novel evaluation methodology from the perspective of generalizability. We employ an Autoencoder to learn the distributions of the generated explanations and observe their learnability as well as the plausibility of the learned distributional features. We first briefly demonstrate the evaluation idea of the proposed approach at LIME, and then quantitatively evaluate multiple popular explainability methods. We also find that smoothing the explanations with SmoothGrad can significantly enhance the generalizability of explanations.

The Generalizability of Explanations

TL;DR

This work tackles the problem of objectively evaluating post-hoc explanations without ground-truth by introducing a generalizability-based framework. It uses an Autoencoder to learn the distribution of explanations generated by a given method and assesses both learnability (how well explanations can be reconstructed) and distribution proximity (how closely reconstructed explanations resemble the original data distribution). The approach enables quantitative comparisons across gradient-based and perturbation-based explainability methods and reveals that perturbation-based methods, as well as SmoothGrad-enhanced variants, tend to yield more generalizable explanation distributions. The findings offer a practical, model-agnostic metric for selecting and refining explainability techniques with potential implications for trustworthy AI deployment, particularly in high-stakes domains like vision tasks.

Abstract

Due to the absence of ground truth, objective evaluation of explainability methods is an essential research direction. So far, the vast majority of evaluations can be summarized into three categories, namely human evaluation, sensitivity testing, and salinity check. This work proposes a novel evaluation methodology from the perspective of generalizability. We employ an Autoencoder to learn the distributions of the generated explanations and observe their learnability as well as the plausibility of the learned distributional features. We first briefly demonstrate the evaluation idea of the proposed approach at LIME, and then quantitatively evaluate multiple popular explainability methods. We also find that smoothing the explanations with SmoothGrad can significantly enhance the generalizability of explanations.
Paper Structure (13 sections, 9 equations, 12 figures, 3 tables)

This paper contains 13 sections, 9 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: An overview of the evaluation methods. We first select the explainability method to be evaluated and generate explanations utilizing the classification model and the original inputs. Subsequently, we train a generative model that takes the original image as input and attempts a reconstruction of the generated explanations. Finally, we compare the distributional relationships between the reconstructed instances and the explanations.
  • Figure 2: The training curves of LIME with perturbed sample numbers of 10 (yellow), 30 (green), 50 (blue), 100 (red), and 500 (purple), respectively. The y-axis is the value of the corresponding metrics and the x-axis is the training epoch numbers.
  • Figure 3: The intra (blue) and inter (orange) class similarity (discrepancy) of the samples generated by Autoencoder based on the explanations of LIME with different number of perturbations. The x-axis from left to right shows the LIME for 10, 30, 50, 100 and 500 perturbed samples, respectively, and the y-axis is the Spearman coefficient (left) and Fréchet Inception Distance (right), respectively. Note that large Spearman coefficients represent similar distributions, while FIDs are the opposite.
  • Figure 4: The training curves of Vanilla Gradients, GB, IxG, IG, LRP, DeepLift, LIME, KernelSHAP and random explanation, respectively. The y-axis is the value of the corresponding metrics and the x-axis is the training epoch numbers.
  • Figure 5: The intra (blue) and inter (orange) class similarity (discrepancy) of the samples generated by Autoencoder based on the explanations of various explainability approaches. DPL, GB, IG IxG, KSHAP, and V denote DeepLift, Guided Backpropagation, Integrated Gradients, Input$\times$Gradients, KernelSHAP, and Vanilla Gradients, respectively, and the y-axis is the Spearman coefficient (left) and Fréchet Inception Distance (right), respectively. The FIDs of the perturbation-based explanations are separated since they are not in the same order of magnitude as the rest.
  • ...and 7 more figures