Exploring the Interplay of Interpretability and Robustness in Deep Neural Networks: A Saliency-guided Approach

Amira Guesmi; Nishant Suresh Aswani; Muhammad Shafique

Exploring the Interplay of Interpretability and Robustness in Deep Neural Networks: A Saliency-guided Approach

Amira Guesmi, Nishant Suresh Aswani, Muhammad Shafique

TL;DR

Problem: balancing interpretability and robustness in deep neural networks under adversarial threats. Approach: evaluate Saliency-guided Training (SGT) and introduce Adversarial Saliency-guided Training (ASGT), which preserves salient features during adversarial training while minimizing divergence between outputs of clean and masked inputs. Contributions: evidence that SGT improves robustness (contrary to some prior claims), a novel ASGT framework that combines SGT with AT, and substantial robustness gains on MNIST and CIFAR-10 with improved saliency maps alongside open-source code. Significance: establishes a practical path toward robust, interpretable DNNs for safety-critical applications.

Abstract

Adversarial attacks pose a significant challenge to deploying deep learning models in safety-critical applications. Maintaining model robustness while ensuring interpretability is vital for fostering trust and comprehension in these models. This study investigates the impact of Saliency-guided Training (SGT) on model robustness, a technique aimed at improving the clarity of saliency maps to deepen understanding of the model's decision-making process. Experiments were conducted on standard benchmark datasets using various deep learning architectures trained with and without SGT. Findings demonstrate that SGT enhances both model robustness and interpretability. Additionally, we propose a novel approach combining SGT with standard adversarial training to achieve even greater robustness while preserving saliency map quality. Our strategy is grounded in the assumption that preserving salient features crucial for correctly classifying adversarial examples enhances model robustness, while masking non-relevant features improves interpretability. Our technique yields significant gains, achieving a 35\% and 20\% improvement in robustness against PGD attack with noise magnitudes of $0.2$ and $0.02$ for the MNIST and CIFAR-10 datasets, respectively, while producing high-quality saliency maps.

Exploring the Interplay of Interpretability and Robustness in Deep Neural Networks: A Saliency-guided Approach

TL;DR

Abstract

and

for the MNIST and CIFAR-10 datasets, respectively, while producing high-quality saliency maps.

Paper Structure (11 sections, 9 equations, 6 figures, 1 table, 2 algorithms)

This paper contains 11 sections, 9 equations, 6 figures, 1 table, 2 algorithms.

Introduction
Related Work
Adversarial Training (AT)
Saliency-guided Training (SGT) for enhancing DNN interpretability
Saliency-guided Adversarial Training (SGA) for Learning Generalizable Features
Adversarial Saliency Guided Training (ASGT)
Experiments and Results
Experimental Setup
Does Saliency-Based Training Enhance Robustness for Deep Neural Networks?
Impact of Adversarial Saliency-guided Training on both Model Robustness and Interpretability
Conclusion

Figures (6)

Figure 1: Overview of our proposed adversarial saliency-guided training (ASGT).
Figure 2: Robustness of models against adversarial examples on the MNIST dataset. Models trained using standard training (ST) and SGT with varying degrees of feature masking ($k = 0.3$ and $k = 0.5$) across various magnitudes of noise ($\epsilon$) for the FGSM, PGD, and MIFGSM attacks.
Figure 3: Robustness of models against adversarial examples on the CIFAR-10 dataset. Models trained using standard training (ST) and SGT with varying degrees of feature masking ($k = 0.1$ and $k = 0.2$) across various magnitudes of noise ($\epsilon$) for the FGSM, PGD, and MIFGSM attacks.
Figure 4: Robustness of models against adversarial examples on the MNIST dataset. Models trained using ST, SGT ($k = 0.3$), ASGT ($k = 0.3$), AT-FGSM, and SGA across various magnitudes of noise ($\epsilon$) for the FGSM, PGD, and MIFGSM attacks.
Figure 5: Robustness of models against adversarial examples on the CIFAR-10 dataset. Models trained using ST, SGT ($k = 0.1$), ASGT ($k = 0.1$), AT-FGSM, and SGA across various magnitudes of noise ($\epsilon$) for the FGSM, PGD, and MIFGSM attacks.
...and 1 more figures

Exploring the Interplay of Interpretability and Robustness in Deep Neural Networks: A Saliency-guided Approach

TL;DR

Abstract

Exploring the Interplay of Interpretability and Robustness in Deep Neural Networks: A Saliency-guided Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (6)