Table of Contents
Fetching ...

Explanation-based Training with Differentiable Insertion/Deletion Metric-aware Regularizers

Yuya Yoshikawa, Tomoharu Iwata

TL;DR

This work tackles explanation faithfulness by introducing ID-ExpO, a training framework that jointly optimizes predictive accuracy and explanation quality through differentiable insertion/deletion metric-based regularizers. By replacing non-differentiable masking with soft-mask approximations, it makes the faithfulness metrics $\mathrm{Ins}$ and $\mathrm{Del}$ differentiable with respect to the explanations, enabling backpropagation for both perturbation-based (e.g., LIME, KernelSHAP) and gradient-based (e.g., Grad-CAM) explainers. Empirical results on image and tabular datasets show that ID-ExpO yields more faithful explanations (higher insertion, lower deletion scores) while maintaining competitive accuracy, outperforming stability- and fidelity-oriented prior methods. The approach broadens the practical utility of post-hoc explainers and can be extended to inherently interpretable models, with publicly available code to reproduce results.

Abstract

The quality of explanations for the predictions made by complex machine learning predictors is often measured using insertion and deletion metrics, which assess the faithfulness of the explanations, i.e., how accurately the explanations reflect the predictor's behavior. To improve the faithfulness, we propose insertion/deletion metric-aware explanation-based optimization (ID-ExpO), which optimizes differentiable predictors to improve both the insertion and deletion scores of the explanations while maintaining their predictive accuracy. Because the original insertion and deletion metrics are non-differentiable with respect to the explanations and directly unavailable for gradient-based optimization, we extend the metrics so that they are differentiable and use them to formalize insertion and deletion metric-based regularizers. Our experimental results on image and tabular datasets show that the deep neural network-based predictors that are fine-tuned using ID-ExpO enable popular post-hoc explainers to produce more faithful and easier-to-interpret explanations while maintaining high predictive accuracy. The code is available at https://github.com/yuyay/idexpo.

Explanation-based Training with Differentiable Insertion/Deletion Metric-aware Regularizers

TL;DR

This work tackles explanation faithfulness by introducing ID-ExpO, a training framework that jointly optimizes predictive accuracy and explanation quality through differentiable insertion/deletion metric-based regularizers. By replacing non-differentiable masking with soft-mask approximations, it makes the faithfulness metrics and differentiable with respect to the explanations, enabling backpropagation for both perturbation-based (e.g., LIME, KernelSHAP) and gradient-based (e.g., Grad-CAM) explainers. Empirical results on image and tabular datasets show that ID-ExpO yields more faithful explanations (higher insertion, lower deletion scores) while maintaining competitive accuracy, outperforming stability- and fidelity-oriented prior methods. The approach broadens the practical utility of post-hoc explainers and can be extended to inherently interpretable models, with publicly available code to reproduce results.

Abstract

The quality of explanations for the predictions made by complex machine learning predictors is often measured using insertion and deletion metrics, which assess the faithfulness of the explanations, i.e., how accurately the explanations reflect the predictor's behavior. To improve the faithfulness, we propose insertion/deletion metric-aware explanation-based optimization (ID-ExpO), which optimizes differentiable predictors to improve both the insertion and deletion scores of the explanations while maintaining their predictive accuracy. Because the original insertion and deletion metrics are non-differentiable with respect to the explanations and directly unavailable for gradient-based optimization, we extend the metrics so that they are differentiable and use them to formalize insertion and deletion metric-based regularizers. Our experimental results on image and tabular datasets show that the deep neural network-based predictors that are fine-tuned using ID-ExpO enable popular post-hoc explainers to produce more faithful and easier-to-interpret explanations while maintaining high predictive accuracy. The code is available at https://github.com/yuyay/idexpo.
Paper Structure (20 sections, 12 equations, 14 figures, 1 table)

This paper contains 20 sections, 12 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Overview of the forward and backward computations during training using ID-ExpO. (Left) the entire computational flow for each training sample $(\boldsymbol{x}_n, y_n)$. The computation flows inside the (center) LIME/KernelSHAP and (right) Grad-CAM explainers. Here, the red double line in Grad-CAM indicates that it computes second-order derivatives when it updates predictor $f_{\theta}$, as it uses the gradients w.r.t. feature maps to obtain $\boldsymbol{\phi}$.
  • Figure 2: Mean insertion and mean one-minus-deletion scores against accuracy on CIFAR-10 in the case of $S = 0.5 \cdot HW$. The top row shows the results for LIME, and the bottom row shows the results for Grad-CAM. Each point indicates the result for the hyperparameters chosen on the basis of (\ref{['eq:experiment:valscore']}) with a different accuracy weight $\eta \in \{0.5, 1.0. \cdots, 3.0 \}$ (different $\eta$ values can be plotted in the same location). The higher the score, the better.
  • Figure 3: Mean sensitivity-$n$ scores against accuracy on CIFAR-10 (left) and STL-10 (right) in the case of $S = 0.3 \cdot HW$. The higher the score, the better.
  • Figure 4: Differences in the insertion and one-minus-deletion scores between before and after predictors were fine-tuned, with Grad-CAM using each method for 1,000 randomly selected individual test samples on CIFAR-10. $\mathtt{Ins}$ and $\mathtt{Del}$ indicate the mean insertion and deletion scores over the test set when the predictors are used after fine-tuning, whereas $\mathtt{Ins}^{(0)}$ and $\mathtt{Del}^{(0)}$ indicate the same scores before the fine-tuning. The percentage in each quadrant is the ratio of the samples located in the quadrant.
  • Figure 5: Examples of the produced explanations for STL-10. The first row shows the results obtained by Grad-CAM, while the other shows the results obtained by LIME. Each row illustrates (a) an input image, (b)--(c) the heatmaps of the explanations by the explainers with ID-ExpO and $\ell_{\mathrm{CE}}$-only, and (d)--(e) the insertion score (top) and the deletion score (bottom) for those explanations in the case of $S=0.5\cdot HW$, which means that the scores are the blue areas to the left of red vertical lines.
  • ...and 9 more figures