Are Classification Robustness and Explanation Robustness Really Strongly Correlated? An Analysis Through Input Loss Landscape

Tiejin Chen; Wenwang Huang; Linsey Pang; Dongsheng Luo; Hua Wei

Are Classification Robustness and Explanation Robustness Really Strongly Correlated? An Analysis Through Input Loss Landscape

Tiejin Chen, Wenwang Huang, Linsey Pang, Dongsheng Luo, Hua Wei

TL;DR

It is demonstrated that enhancing explanation robustness does not necessarily flatten the input loss landscape with respect to explanation loss - contrary to flattened loss landscapes indicating better classification robustness.

Abstract

This paper delves into the critical area of deep learning robustness, challenging the conventional belief that classification robustness and explanation robustness in image classification systems are inherently correlated. Through a novel evaluation approach leveraging clustering for efficient assessment of explanation robustness, we demonstrate that enhancing explanation robustness does not necessarily flatten the input loss landscape with respect to explanation loss - contrary to flattened loss landscapes indicating better classification robustness. To deeply investigate this contradiction, a groundbreaking training method designed to adjust the loss landscape with respect to explanation loss is proposed. Through the new training method, we uncover that although such adjustments can impact the robustness of explanations, they do not have an influence on the robustness of classification. These findings not only challenge the prevailing assumption of a strong correlation between the two forms of robustness but also pave new pathways for understanding relationship between loss landscape and explanation loss.

Are Classification Robustness and Explanation Robustness Really Strongly Correlated? An Analysis Through Input Loss Landscape

TL;DR

Abstract

Paper Structure (22 sections, 5 equations, 8 figures, 9 tables, 1 algorithm)

This paper contains 22 sections, 5 equations, 8 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Loss Landscape Visualization
Methods
Experimental Results
Experimental Settings
Datasets
Model Architecture
Explanation Methods
Training Methods
Hyperparameters
Metrics
Separating Explanation and Classification Robustness
Influence of Different Explanation Methods in Training Phase
Influence of Different Explanation Methods in Testing Phase
...and 7 more sections

Figures (8)

Figure 1: (a) Illustration of an adversarial attack on explanation, demonstrating the manipulation of explanation maps from the original image to achieve a target, resulting in explanation loss (b) A visualization of input loss landscape w.r.t classification loss, comparing a normal-trained model to an adversarial-trained model.
Figure 2: The explanations from different clusters generated by our clustering method on CIFAR10. The two images with different labels in the same cluster share a similar explanation while they both show a different explanation with the image from another cluster. The results show that our method can pick the most representative images w.r.t explanation.
Figure 3: Input Loss landscape w.r.t explanation loss for models trained with different with different $\alpha$ in TRADES. The results show that the loss landscape w.r.t explanation robustness does not show a clear difference between models that vary in explanation robustness.
Figure 4: How does our method influence the saliency maps calculated from gradient x inputs on CIFAR10. Intuitively, $SEP_{pos}$ makes the model consider more input pixels, solely adversarial training makes the model consider only a few input pixels while $SEP_{neg}$ considers even fewer input pixels compared with adversarial training. However, models trained with these three methods show the same classification robustness.
Figure 5: Performance of varying explanation methods in the testing phase, w.r.t. explanation loss at start, at end, and adversarial accuracy. Models are trained with Gradient x Input and tested on different explanation methods. All models are trained on CIFAR10. Even if the explanation methods during training and testing are different, $SEP_{pos}$ shows a lower explanation loss compared to $SEP_{neg}$, while they have similar adversarial accuracy
...and 3 more figures

Theorems & Definitions (1)

Definition 1: Explanation Loss

Are Classification Robustness and Explanation Robustness Really Strongly Correlated? An Analysis Through Input Loss Landscape

TL;DR

Abstract

Are Classification Robustness and Explanation Robustness Really Strongly Correlated? An Analysis Through Input Loss Landscape

Authors

TL;DR

Abstract

Table of Contents

Figures (8)

Theorems & Definitions (1)