Table of Contents
Fetching ...

Are We Merely Justifying Results ex Post Facto? Quantifying Explanatory Inversion in Post-Hoc Model Explanations

Zhen Tan, Song Wang, Yifan Li, Yu Kong, Jundong Li, Tianlong Chen, Huan Liu

TL;DR

The paper defines Explanatory Inversion and introduces Inversion Quantification (IQ) to assess whether post-hoc explanations rely on model outputs rather than faithful input-output relations. IQ combines Reliance on Outputs $R$ and Faithfulness $F$ into the Inversion Score $IS(R,F)=((R^p+(1-F)^p)/2)^{1/p}$, revealing that widely used methods like SHAP and LIME exhibit inversion across tabular, image, and text domains, especially under spurious correlations. To mitigate this, the authors propose Reproduce-by-Poking (RBP), which adds forward perturbation checks and refines attributions via $ ilde{a}^{(j)} = a^{(j)}/(1 + abla^{(j)} \\lambda)$; they prove that RBP reduces dependence on outputs and improves faithfulness, with empirical reductions in inversion score $ ext{IS}$ across modalities. The approach yields practical robustness to spurious features, demonstrated on synthetic data and a CIFAR-10 ResNet-18 case, suggesting significant improvements for trustworthy post-hoc explanations in real-world AI systems.

Abstract

Post-hoc explanation methods provide interpretation by attributing predictions to input features. Natural explanations are expected to interpret how the inputs lead to the predictions. Thus, a fundamental question arises: Do these explanations unintentionally reverse the natural relationship between inputs and outputs? Specifically, are the explanations rationalizing predictions from the output rather than reflecting the true decision process? To investigate such explanatory inversion, we propose Inversion Quantification (IQ), a framework that quantifies the degree to which explanations rely on outputs and deviate from faithful input-output relationships. Using the framework, we demonstrate on synthetic datasets that widely used methods such as LIME and SHAP are prone to such inversion, particularly in the presence of spurious correlations, across tabular, image, and text domains. Finally, we propose Reproduce-by-Poking (RBP), a simple and model-agnostic enhancement to post-hoc explanation methods that integrates forward perturbation checks. We further show that under the IQ framework, RBP theoretically guarantees the mitigation of explanatory inversion. Empirically, for example, on the synthesized data, RBP can reduce the inversion by 1.8% on average across iconic post-hoc explanation approaches and domains.

Are We Merely Justifying Results ex Post Facto? Quantifying Explanatory Inversion in Post-Hoc Model Explanations

TL;DR

The paper defines Explanatory Inversion and introduces Inversion Quantification (IQ) to assess whether post-hoc explanations rely on model outputs rather than faithful input-output relations. IQ combines Reliance on Outputs and Faithfulness into the Inversion Score , revealing that widely used methods like SHAP and LIME exhibit inversion across tabular, image, and text domains, especially under spurious correlations. To mitigate this, the authors propose Reproduce-by-Poking (RBP), which adds forward perturbation checks and refines attributions via ; they prove that RBP reduces dependence on outputs and improves faithfulness, with empirical reductions in inversion score across modalities. The approach yields practical robustness to spurious features, demonstrated on synthetic data and a CIFAR-10 ResNet-18 case, suggesting significant improvements for trustworthy post-hoc explanations in real-world AI systems.

Abstract

Post-hoc explanation methods provide interpretation by attributing predictions to input features. Natural explanations are expected to interpret how the inputs lead to the predictions. Thus, a fundamental question arises: Do these explanations unintentionally reverse the natural relationship between inputs and outputs? Specifically, are the explanations rationalizing predictions from the output rather than reflecting the true decision process? To investigate such explanatory inversion, we propose Inversion Quantification (IQ), a framework that quantifies the degree to which explanations rely on outputs and deviate from faithful input-output relationships. Using the framework, we demonstrate on synthetic datasets that widely used methods such as LIME and SHAP are prone to such inversion, particularly in the presence of spurious correlations, across tabular, image, and text domains. Finally, we propose Reproduce-by-Poking (RBP), a simple and model-agnostic enhancement to post-hoc explanation methods that integrates forward perturbation checks. We further show that under the IQ framework, RBP theoretically guarantees the mitigation of explanatory inversion. Empirically, for example, on the synthesized data, RBP can reduce the inversion by 1.8% on average across iconic post-hoc explanation approaches and domains.

Paper Structure

This paper contains 34 sections, 4 theorems, 45 equations, 14 figures, 2 tables.

Key Result

Theorem 4.5

The proposed Inversion Score (IS) effectively quantifies explanatory inversion as defined in Definition def:ei. The proof is presented in Appendix app:is_measures_inversion_proof. Specifically:

Figures (14)

  • Figure 1: Illustration of post-hoc explanation methods and the potential of explanatory inversion. For tabular data (first row), ground-truth explanations attribute the output $y = x^{(1)} \cdot sin(x^{(2)})$ to the contributions of features $x^{(1)}$ and $x^{(2)}$, but explanatory inversion misattributes $x^{(2)}$ and $x^{(3)}$. For image data (second row), explanations should focus on the correct object, but explanatory inversion leads to incorrect focus regions. For text data (third row), ground-truth explanations link keywords to labels, but explanatory inversion results in misaligned or irrelevant attributions.
  • Figure 2: Illustration of IQ with spurious feature injection and its impact on explanations across modalities. For tabular data (first row), during training, feature $x^{(3)}$ follows a standard normal distribution, making it independent of the target variable. At test time, a spurious correlation is introduced where $x^{(3)}$ is linearly dependent on $y$ with noise $\varepsilon$, leading to incorrect reliance. For image data (second row), a distractor is injected into the test set, shifting explanations toward irrelevant regions. For text data (third row), an additional token (e.g., peach) appears in test samples, causing explanations to assign importance to non-informative words.
  • Figure 3: Visualization of feature attributions for the shape classification task under both normal and spurious scenarios. Each row displays an input image and feature attributions generated by four post-hoc explanation methods: Integrated Gradients (IG), Occlusion, Shapley Value Sampling, and LIME. Columns show comparisons between normal (left) and spurious (right) conditions. The spurious scenario introduces a bright distractor pixel in the top-left corner of images labeled as 1, which leads to incorrect attributions in several methods. Desired focus regions (e.g., the object shape) are highlighted under normal conditions, while spurious conditions shift the attributions toward the irrelevant injected pixel. More case studies are included in Appendix \ref{['app:case']}.
  • Figure 4: Overview of RBP, divided into three stages. In the Attribution Perturbation stage (left), multiple perturbed samples are generated from a given input sample $\mathbf{x}$, altering feature values. In the Attribution Deviation stage (middle), deviations $\delta^{(j)}$ are computed for each feature $a^{(j)}$ based on differences across perturbations. Finally, in the Attribution Refinement stage (right), attributions are refined by reducing the influence of features with high deviation, yielding adjusted attributions $a^{\prime {(j)}}$.
  • Figure 5: Ablation study analyzing (i). the properties of explanatory inversion via different spurious injection settings of IQ; and (ii) robustness and effectiveness of RBP under various hyper-parameter conditions. (a) Impact of varying the number of spurious features on $\Delta \mathrm{IS}$. (b) Influence of spurious feature strength $\psi$ on $\Delta \mathrm{IS}$. (c) Effect of the number of perturbation check in RBP on IS. (d) Influence of perturbation magnitude in RBP on $\mathrm{IS}$.
  • ...and 9 more figures

Theorems & Definitions (13)

  • Definition 4.1
  • Definition 4.2
  • Definition 4.3
  • Definition 4.4
  • Theorem 4.5
  • Definition 4.6
  • Theorem 5.1: Reduction of Output Reliance
  • Theorem 5.2: Faithfulness Improvement
  • Theorem 5.3: Resilience to Spurious Features
  • proof
  • ...and 3 more