On the Robustness of Explanations of Deep Neural Network Models: A Survey
Amlan Jyoti, Karthik Balaji Ganesh, Manoj Gayala, Nandita Lakshmi Tunuguntla, Sandesh Kamath, Vineeth N Balasubramanian
TL;DR
This survey addresses the largely overlooked problem of robustness in explanations for deep neural networks. It synthesizes methods that study, attack, and defend attribution maps, surveys notations, axioms, and fidelity/robustness metrics, and catalogs attacks across image, text, and tabular data. The authors connect attributional robustness to adversarial robustness, presenting principled defenses—ranging from certified robustness and Lipschitz-based regularization to data augmentation and saliency-map aggregation—while outlining practical implications and future research directions. The work underscores the need for consistent, reliable explanations in safety-critical settings and provides a structured roadmap for improving robust interpretability in practice.
Abstract
Explainability has been widely stated as a cornerstone of the responsible and trustworthy use of machine learning models. With the ubiquitous use of Deep Neural Network (DNN) models expanding to risk-sensitive and safety-critical domains, many methods have been proposed to explain the decisions of these models. Recent years have also seen concerted efforts that have shown how such explanations can be distorted (attacked) by minor input perturbations. While there have been many surveys that review explainability methods themselves, there has been no effort hitherto to assimilate the different methods and metrics proposed to study the robustness of explanations of DNN models. In this work, we present a comprehensive survey of methods that study, understand, attack, and defend explanations of DNN models. We also present a detailed review of different metrics used to evaluate explanation methods, as well as describe attributional attack and defense methods. We conclude with lessons and take-aways for the community towards ensuring robust explanations of DNN model predictions.
