Provable Robust Saliency-based Explanations
Chao Chen, Chenghua Guo, Rufeng Chen, Guixiang Ma, Ming Zeng, Xiangwen Liao, Xi Zhang, Sihong Xie
TL;DR
This work addresses the trustworthiness of explanations by moving beyond $\,\ell_p$-based stability to a ranking-centric notion called explanation thickness for the top-$k$ salient features. It introduces R2ET, a training objective that regularizes thickness (and Hessian-related terms) to promote robust rankings, and establishes theoretical links to certified robustness and connections to adversarial training and constrained optimization. Through a multi-objective robustness analysis and extensive experiments across tabular, image, and graph datasets, R2ET demonstrates superior stability of saliency rankings under stealthy attacks while preserving predictive accuracy and generalization across explanation methods. The results suggest thickness as a high-signal criterion for selecting robust explanations and provide practical, theoretically grounded defenses against manipulation of explanations in real-world settings.
Abstract
To foster trust in machine learning models, explanations must be faithful and stable for consistent insights. Existing relevant works rely on the $\ell_p$ distance for stability assessment, which diverges from human perception. Besides, existing adversarial training (AT) associated with intensive computations may lead to an arms race. To address these challenges, we introduce a novel metric to assess the stability of top-$k$ salient features. We introduce R2ET which trains for stable explanation by efficient and effective regularizer, and analyze R2ET by multi-objective optimization to prove numerical and statistical stability of explanations. Moreover, theoretical connections between R2ET and certified robustness justify R2ET's stability in all attacks. Extensive experiments across various data modalities and model architectures show that R2ET achieves superior stability against stealthy attacks, and generalizes effectively across different explanation methods.
