Reliable Explanations or Random Noise? A Reliability Metric for XAI
Poushali Sengupta, Sabita Maharjan, Frank Eliassen, Shashi Raj Pandey, Yan Zhang
TL;DR
The paper tackles the problem of explaining the explanations themselves by introducing the Explanation Reliability Index (ERI), a principled, axiomatic framework to quantify how stable attribution signals are under non-adversarial perturbations, redundancy, model evolution, and distributional shifts. ERI decomposes stability into five components (ERI-S, ERI-R, ERI-M, ERI-D, ERI-T) and aggregates them into a scalar reliability score, supported by Lipschitz-type guarantees and temporal bounds. It also introduces ERI-Bench, a standardized benchmark to stress-test explanation reliability across diverse data modalities, and provides theoretical results showing that popular explainers often fail reliability axioms while dependence-aware methods like MCIR exhibit stronger reliability. Across EEG, HAR, energy-load forecasting, and CIFAR-10, empirical results reveal substantial reliability gaps in common explainers and highlight the practical value of ERI for reliability-aware XAI. The work positions ERI as a complementary layer to faithfulness and causality, enabling safer, more trustworthy deployment of XAI in high-stakes applications.
Abstract
In recent years, explaining decisions made by complex machine learning models has become essential in high-stakes domains such as energy systems, healthcare, finance, and autonomous systems. However, the reliability of these explanations, namely, whether they remain stable and consistent under realistic, non-adversarial changes, remains largely unmeasured. Widely used methods such as SHAP and Integrated Gradients (IG) are well-motivated by axiomatic notions of attribution, yet their explanations can vary substantially even under system-level conditions, including small input perturbations, correlated representations, and minor model updates. Such variability undermines explanation reliability, as reliable explanations should remain consistent across equivalent input representations and small, performance-preserving model changes. We introduce the Explanation Reliability Index (ERI), a family of metrics that quantifies explanation stability under four reliability axioms: robustness to small input perturbations, consistency under feature redundancy, smoothness across model evolution, and resilience to mild distributional shifts. For each axiom, we derive formal guarantees, including Lipschitz-type bounds and temporal stability results. We further propose ERI-T, a dedicated measure of temporal reliability for sequential models, and introduce ERI-Bench, a benchmark designed to systematically stress-test explanation reliability across synthetic and real-world datasets. Experimental results reveal widespread reliability failures in popular explanation methods, showing that explanations can be unstable under realistic deployment conditions. By exposing and quantifying these instabilities, ERI enables principled assessment of explanation reliability and supports more trustworthy explainable AI (XAI) systems.
