On the Robustness of Explanations of Deep Neural Network Models: A Survey

Amlan Jyoti; Karthik Balaji Ganesh; Manoj Gayala; Nandita Lakshmi Tunuguntla; Sandesh Kamath; Vineeth N Balasubramanian

On the Robustness of Explanations of Deep Neural Network Models: A Survey

Amlan Jyoti, Karthik Balaji Ganesh, Manoj Gayala, Nandita Lakshmi Tunuguntla, Sandesh Kamath, Vineeth N Balasubramanian

TL;DR

This survey addresses the largely overlooked problem of robustness in explanations for deep neural networks. It synthesizes methods that study, attack, and defend attribution maps, surveys notations, axioms, and fidelity/robustness metrics, and catalogs attacks across image, text, and tabular data. The authors connect attributional robustness to adversarial robustness, presenting principled defenses—ranging from certified robustness and Lipschitz-based regularization to data augmentation and saliency-map aggregation—while outlining practical implications and future research directions. The work underscores the need for consistent, reliable explanations in safety-critical settings and provides a structured roadmap for improving robust interpretability in practice.

Abstract

Explainability has been widely stated as a cornerstone of the responsible and trustworthy use of machine learning models. With the ubiquitous use of Deep Neural Network (DNN) models expanding to risk-sensitive and safety-critical domains, many methods have been proposed to explain the decisions of these models. Recent years have also seen concerted efforts that have shown how such explanations can be distorted (attacked) by minor input perturbations. While there have been many surveys that review explainability methods themselves, there has been no effort hitherto to assimilate the different methods and metrics proposed to study the robustness of explanations of DNN models. In this work, we present a comprehensive survey of methods that study, understand, attack, and defend explanations of DNN models. We also present a detailed review of different metrics used to evaluate explanation methods, as well as describe attributional attack and defense methods. We conclude with lessons and take-aways for the community towards ensuring robust explanations of DNN model predictions.

On the Robustness of Explanations of Deep Neural Network Models: A Survey

TL;DR

Abstract

Paper Structure (51 sections, 11 figures, 7 tables)

This paper contains 51 sections, 11 figures, 7 tables.

Introduction
Overview of Explainability Methods
Evaluation of Explainability Methods
Notations
Properties and Axioms
Evaluating the Quality of Explanation methods
Evaluating the Robustness of Explanation methods
Attacking Explainability Methods
Common Techniques and Metrics
Top-$k$ fooling
COM shift
Targeted attack
Attributional Attack Methods
Model-Specific Attacks
Image Data
...and 36 more sections

Figures (11)

Figure 1: (Top row,a-d from Left to Right) (a) Original x-ray of a lung cancer patient; (b) Attribution map of original x-ray highlighting relevant areas of the x-ray; (c) Watermarks added to x-ray in (a); (d) Modified attribution map obtained of x-ray with watermarks with prediction remaining the same. The model accurately predicts the presence of lung cancer in (a) and (c), but the corresponding attribution maps in (b) and (d) respectively are not robust. Note how the areas with the watermark are highlighted here, indicating that the model uses the watermarks to make the decision, making the model unreliable. (Bottom row,a-d from Left to Right) (a) Original x-ray of a COVID-19 infected patient; (b) Attribution map of original x-ray highlighting relevant areas of the x-ray; (c) Human-imperceptible perturbation added to x-ray in (a); (d) Modified attribution map of perturbed x-ray in (c). The model accurately predicts the presence of COVID-19 infection in (a) and (c), but the corresponding attribution maps in (b) and (d) respectively are not robust. If a physician relied on such modified attribution maps for understanding the disease, it could be life-threatening for the patient.
Figure 2: (a) AkhtarM18 (left) clean image (right) adversarially perturbed image (b)ghorbani2017fragile (top) clean image (bottom) attributionally perturbed image (c)zhang2018fire Explanation maps are same yet different predictions.
Figure 3: Example of attributional attack on text data (reproduced from ivankay2020far)
Figure 4: hubert2021PD construct a targeted Attack on tabular data by data poisoning (equivalent to input perturbation for images) which is captured by the deviation (red line) from the original dependency graph (blue line) can be deviated towards a target graph.
Figure 5: (top) dombrowski2019geometry provide the intuition for their approach by observing the normals at the decision boundary. (top left) The gradient (red arrows) changes drastically when moving along a line with high curvature but changes gradually when the curvature is low. (top right) They also propose techniques like weight decay that flattens the angles between piece-wise linear functions, softplus smooths out the kinks of the ReLU function and hessian minimization reduces the curvature locally at the data point. (bottom) Moosavi-Dezfooli19 shows the benefits of adversarial training by illustrating the negative loss function along normal and random directions r and v. They observe that the original network (bottom: a,c) has large curvature in those directions. Adversarial training results in a lower curvature of the loss (bottom: b, d). The original sample is illustrated with a blue dot along with a light blue surface (classification region of the sample), and a red region (adversarial region), respectively.
...and 6 more figures

On the Robustness of Explanations of Deep Neural Network Models: A Survey

TL;DR

Abstract

On the Robustness of Explanations of Deep Neural Network Models: A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (11)