Provable Robust Saliency-based Explanations

Chao Chen; Chenghua Guo; Rufeng Chen; Guixiang Ma; Ming Zeng; Xiangwen Liao; Xi Zhang; Sihong Xie

Provable Robust Saliency-based Explanations

Chao Chen, Chenghua Guo, Rufeng Chen, Guixiang Ma, Ming Zeng, Xiangwen Liao, Xi Zhang, Sihong Xie

TL;DR

This work addresses the trustworthiness of explanations by moving beyond $\,\ell_p$-based stability to a ranking-centric notion called explanation thickness for the top-$k$ salient features. It introduces R2ET, a training objective that regularizes thickness (and Hessian-related terms) to promote robust rankings, and establishes theoretical links to certified robustness and connections to adversarial training and constrained optimization. Through a multi-objective robustness analysis and extensive experiments across tabular, image, and graph datasets, R2ET demonstrates superior stability of saliency rankings under stealthy attacks while preserving predictive accuracy and generalization across explanation methods. The results suggest thickness as a high-signal criterion for selecting robust explanations and provide practical, theoretically grounded defenses against manipulation of explanations in real-world settings.

Abstract

To foster trust in machine learning models, explanations must be faithful and stable for consistent insights. Existing relevant works rely on the $\ell_p$ distance for stability assessment, which diverges from human perception. Besides, existing adversarial training (AT) associated with intensive computations may lead to an arms race. To address these challenges, we introduce a novel metric to assess the stability of top-$k$ salient features. We introduce R2ET which trains for stable explanation by efficient and effective regularizer, and analyze R2ET by multi-objective optimization to prove numerical and statistical stability of explanations. Moreover, theoretical connections between R2ET and certified robustness justify R2ET's stability in all attacks. Extensive experiments across various data modalities and model architectures show that R2ET achieves superior stability against stealthy attacks, and generalizes effectively across different explanation methods.

Provable Robust Saliency-based Explanations

TL;DR

This work addresses the trustworthiness of explanations by moving beyond

-based stability to a ranking-centric notion called explanation thickness for the top-

salient features. It introduces R2ET, a training objective that regularizes thickness (and Hessian-related terms) to promote robust rankings, and establishes theoretical links to certified robustness and connections to adversarial training and constrained optimization. Through a multi-objective robustness analysis and extensive experiments across tabular, image, and graph datasets, R2ET demonstrates superior stability of saliency rankings under stealthy attacks while preserving predictive accuracy and generalization across explanation methods. The results suggest thickness as a high-signal criterion for selecting robust explanations and provide practical, theoretically grounded defenses against manipulation of explanations in real-world settings.

Abstract

To foster trust in machine learning models, explanations must be faithful and stable for consistent insights. Existing relevant works rely on the

distance for stability assessment, which diverges from human perception. Besides, existing adversarial training (AT) associated with intensive computations may lead to an arms race. To address these challenges, we introduce a novel metric to assess the stability of top-

salient features. We introduce R2ET which trains for stable explanation by efficient and effective regularizer, and analyze R2ET by multi-objective optimization to prove numerical and statistical stability of explanations. Moreover, theoretical connections between R2ET and certified robustness justify R2ET's stability in all attacks. Extensive experiments across various data modalities and model architectures show that R2ET achieves superior stability against stealthy attacks, and generalizes effectively across different explanation methods.

Paper Structure (38 sections, 18 theorems, 66 equations, 9 figures, 9 tables, 2 algorithms)

This paper contains 38 sections, 18 theorems, 66 equations, 9 figures, 9 tables, 2 algorithms.

Introduction
Related Work
Preliminaries
Explanation Robustness via Thickness
Ranking explanation thickness
R2ET: Training for robust ranking explanations
Analyses of numerical and statistical robustness
Experiments
Compared methods
Overall robustness results
Understanding thickness and attackability
Case study: saliency maps visualization
Conclusion
Acknowledgments
Proofs
...and 23 more sections

Key Result

Proposition 4.4

(Bounds of thickness) Given an $L$-locally Lipschitz model $f$, for some $L>0$, pairwise ranking thickness $\Theta(f, \mathbf{x}, \mathcal{D}, i, j)$ is bounded by where $H_i(\mathbf{x})$ is the derivative of $\mathcal{I}_i(\mathbf{x})$ with respect to $\mathbf{x}$, and $L_i=\max_{\mathbf{x}^\prime \in \mathcal{B}_2(\mathbf{x},\epsilon)} \| H_i(\mathbf{x}^\prime) \|_2$.

Figures (9)

Figure 1: Left: Green (①): Model training. Yellow (②-④): Explanation generation for a target input. Red (⑤-⑥): Adversarial attacks against the explanation by manipulating the input. Right: Two examples of the saliency maps (explanations) show that smaller $\ell_p$ distances do not imply similar top salient features. $\mathcal{I}(\mathbf{x}^{\prime\prime})$ has a smaller $\ell_2$ distance from the original explanation $\mathcal{I}(\mathbf{x})$, but manipulates the explanation more significantly (by top-$k$ metric) shown in blue dashed boxes. Statistically, $\mathcal{I}(\mathbf{x}^\prime)$ has a 67% top-3 overlap in the tabular case, and 36% top-50 overlap in the image, compared with $\mathcal{I}(\mathbf{x}^\prime)$'s 100% and 92% top-$k$ overlap, respectively.
Figure 2: The number of iterations to first flip versus sample-level thickness (left) and Hessian norm (right) for R2ET on COMPAS. Each dot represents an individual sample $\mathbf{x}$.
Figure 3: Explanations of original (ori.) and perturbed (pert.) images against ERAttack from MNIST (class digit 3, $k$=50) and CIFAR-10 (class ship, $k$=100). The top $k$ salient pixels are highlighted, and darker colors indicate higher importance. P@$k$ is reported within each subplot.
Figure 4: We show the correlation between the manipulation epoch and other metrics, including thickness evaluated by adversarial samples, Hessian norm, and thickness evaluated by (mean and min of) Gaussian samples for different models in various datasets. From top to bottom: Vanilla, est-H models on Adult, respectively. R2ET models on Adult, COMPAS, MNIST, ADHD, and BP, respectively.
Figure 5: Saliency maps concerning the original image pair and the image pair perturbed under ERAttack for all methods. The red pixels are the top 50 important features in saliency maps, with darker colors meaning more important.
...and 4 more figures

Theorems & Definitions (25)

Definition 4.1: Pairwise ranking thickness
Definition 4.2: Top-$k$ ranking thickness
Definition 4.3: Locally Lipschitz continuity
Proposition 4.4
Proposition 4.5
Proposition 4.6
Proposition 4.7
Theorem 5.1
Theorem 5.2
Proposition A.1
...and 15 more

Provable Robust Saliency-based Explanations

TL;DR

Abstract

Provable Robust Saliency-based Explanations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (25)