Table of Contents
Fetching ...

Revisiting the robustness of post-hoc interpretability methods

Jiawen Wei, Hugues Turbé, Gianmarco Mengaldo

TL;DR

This paper tackles the problem that post-hoc interpretability methods yield inconsistent explanations and that existing evaluations largely rely on coarse, average-case measures. It introduces a coarse-to-fine evaluation framework combining ridge-line visualizations with two new robustness metrics, $\overline{\operatorname{AUC}\textit{Skew}}$ and $\operatorname{AUC}\textit{Kurt}$, to quantify sample-level stability of explanations in time-series classification. Through experiments on one synthetic and 20 public datasets across CNN, BiLSTM, and Transformer models, it demonstrates that robustness is linked to average performance and that some methods (e.g., DS, SVS) exhibit greater robustness than others, with calibration able to enhance robustness further. The work provides a practical, reproducible toolkit for evaluating and selecting robust post-hoc explanations in high-stakes settings, contributing to more trustworthy AI deployment.

Abstract

Post-hoc interpretability methods play a critical role in explainable artificial intelligence (XAI), as they pinpoint portions of data that a trained deep learning model deemed important to make a decision. However, different post-hoc interpretability methods often provide different results, casting doubts on their accuracy. For this reason, several evaluation strategies have been proposed to understand the accuracy of post-hoc interpretability. Many of these evaluation strategies provide a coarse-grained assessment -- i.e., they evaluate how the performance of the model degrades on average by corrupting different data points across multiple samples. While these strategies are effective in selecting the post-hoc interpretability method that is most reliable on average, they fail to provide a sample-level, also referred to as fine-grained, assessment. In other words, they do not measure the robustness of post-hoc interpretability methods. We propose an approach and two new metrics to provide a fine-grained assessment of post-hoc interpretability methods. We show that the robustness is generally linked to its coarse-grained performance.

Revisiting the robustness of post-hoc interpretability methods

TL;DR

This paper tackles the problem that post-hoc interpretability methods yield inconsistent explanations and that existing evaluations largely rely on coarse, average-case measures. It introduces a coarse-to-fine evaluation framework combining ridge-line visualizations with two new robustness metrics, and , to quantify sample-level stability of explanations in time-series classification. Through experiments on one synthetic and 20 public datasets across CNN, BiLSTM, and Transformer models, it demonstrates that robustness is linked to average performance and that some methods (e.g., DS, SVS) exhibit greater robustness than others, with calibration able to enhance robustness further. The work provides a practical, reproducible toolkit for evaluating and selecting robust post-hoc explanations in high-stakes settings, contributing to more trustworthy AI deployment.

Abstract

Post-hoc interpretability methods play a critical role in explainable artificial intelligence (XAI), as they pinpoint portions of data that a trained deep learning model deemed important to make a decision. However, different post-hoc interpretability methods often provide different results, casting doubts on their accuracy. For this reason, several evaluation strategies have been proposed to understand the accuracy of post-hoc interpretability. Many of these evaluation strategies provide a coarse-grained assessment -- i.e., they evaluate how the performance of the model degrades on average by corrupting different data points across multiple samples. While these strategies are effective in selecting the post-hoc interpretability method that is most reliable on average, they fail to provide a sample-level, also referred to as fine-grained, assessment. In other words, they do not measure the robustness of post-hoc interpretability methods. We propose an approach and two new metrics to provide a fine-grained assessment of post-hoc interpretability methods. We show that the robustness is generally linked to its coarse-grained performance.
Paper Structure (24 sections, 12 equations, 6 figures, 3 tables)

This paper contains 24 sections, 12 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: With the mean score drop of 0.66 at top-15% corruption, DeepSHAP is not robust for all samples in the synthetic dataset.
  • Figure 2: Framework of post-hoc interpretability evaluation consists of four main processes. 1) neural network training for time series classification; 2) post-hoc interpretability methods for relevance score generation in testing set; 3) k-percentile corruption based on relevance ranking; 4) coarse-grained and fine-grained evaluation.
  • Figure 3: Left: $\tilde{\mathcal{S}}_\text{m}-\tilde{N}$ curve for coarse-grained evaluation and distribution example for fine-grained evaluation at top 55% corruption. Right: four distribution shapes commonly observed to visualize the robustness of post-hoc interpretability methods.
  • Figure 4: Left: ridge-line visualization of six post-hoc interpretability methods for a trained Transformer model on synthetic dataset. Top-right: scaled $\textit{skew}$-$k$ curve and $\textit{(E)kurt}$-$k$ curve. Bottom-right: coarse-grained and fine-grained metrics of six methods.
  • Figure 5: Statistical coarse-grained and fine-grained metrics for CNN, BiLSTM, and Transformer models across 20 public datasets.
  • ...and 1 more figures