Table of Contents
Fetching ...

On the Connection Between Adversarial Robustness and Saliency Map Interpretability

Christian Etmann, Sebastian Lunz, Peter Maass, Carola-Bibiane Schönlieb

TL;DR

The study investigates why adversarially robust neural networks often exhibit more interpretable saliency maps. By formalizing robustness as the distance to the decision boundary ($\rho(x)$) and interpretability via gradient-based alignment ($\alpha(x)$), the authors derive exact results for linear models and develop a linearized robustness framework ($\tilde{\rho}(x)$) for non-linear networks, complemented by a homogeneous-decomposition of neural nets. They prove bounds linking robustness and alignment and validate these insights through experiments on MNIST and ImageNet using local Lipschitz regularization and multiple adversarial attacks, finding that the robustness-interpretability link is stronger in more linear regimes. The findings illuminate when and why alignment correlates with robustness, and suggest directions for defenses that leverage saliency alignment alongside adversarial robustness.

Abstract

Recent studies on the adversarial vulnerability of neural networks have shown that models trained to be more robust to adversarial attacks exhibit more interpretable saliency maps than their non-robust counterparts. We aim to quantify this behavior by considering the alignment between input image and saliency map. We hypothesize that as the distance to the decision boundary grows,so does the alignment. This connection is strictly true in the case of linear models. We confirm these theoretical findings with experiments based on models trained with a local Lipschitz regularization and identify where the non-linear nature of neural networks weakens the relation.

On the Connection Between Adversarial Robustness and Saliency Map Interpretability

TL;DR

The study investigates why adversarially robust neural networks often exhibit more interpretable saliency maps. By formalizing robustness as the distance to the decision boundary () and interpretability via gradient-based alignment (), the authors derive exact results for linear models and develop a linearized robustness framework () for non-linear networks, complemented by a homogeneous-decomposition of neural nets. They prove bounds linking robustness and alignment and validate these insights through experiments on MNIST and ImageNet using local Lipschitz regularization and multiple adversarial attacks, finding that the robustness-interpretability link is stronger in more linear regimes. The findings illuminate when and why alignment correlates with robustness, and suggest directions for defenses that leverage saliency alignment alongside adversarial robustness.

Abstract

Recent studies on the adversarial vulnerability of neural networks have shown that models trained to be more robust to adversarial attacks exhibit more interpretable saliency maps than their non-robust counterparts. We aim to quantify this behavior by considering the alignment between input image and saliency map. We hypothesize that as the distance to the decision boundary grows,so does the alignment. This connection is strictly true in the case of linear models. We confirm these theoretical findings with experiments based on models trained with a local Lipschitz regularization and identify where the non-linear nature of neural networks weakens the relation.

Paper Structure

This paper contains 15 sections, 12 theorems, 47 equations, 10 figures.

Key Result

Lemma 1

Let $F$ be a classifier with locally affine score function $\Psi$. Assume $l(x) \geq \rho(x)$. Then for ${i^\ast}:=F(x)$ the predicted class at $x$.

Figures (10)

  • Figure 1: An image of a dog (left), the saliency maps of a highly non-adversarially-robust neural network (middle) and of a more robust network (right). We observe that the robust network gives a much clearer indication of what the classifier deems to be discriminative features. Details about saliency and the robustification are given in section \ref{['sec:experiments']}. Most figures are best viewed on a screen.
  • Figure 2: The median alignment increases with the median robustness of the model on ImageNet. Furthermore, the more elaborate attacks consistently find smaller adversarial perturbations than the simple gradient attack. The linearized robustness estimator provides a rather realistic estimation of the algorithmically calculated robustness.
  • Figure 3: Similar to Figure \ref{['fig:alphax_vs_robustness_imagenet']}, the median alignment increases with the median robustness of the model on MNIST. Towards the end, some saturation effects are visible.
  • Figure 4: The pointwise relationship between $\tilde{\rho}(x)$ and $\alpha(x)$, exemplified on a model trained on ImageNet (left) and MNIST (right). While the two properties are well-correlated on MNIST (fitting the 'averaged' view from Figure \ref{['fig:alphax_vs_robustness_mnist']}), there is no visible correlation in the case of ImageNet.
  • Figure 5: The pointwise relationship between $\tilde{\rho}(x)$ and $\rho(x)$, each calculated for 1000 validation points on a model trained on ImageNet (left) and MNIST (right). $\rho(x)$ was approximately calculated using the CW-attack. In both cases, the correlation is high.
  • ...and 5 more figures

Theorems & Definitions (24)

  • Definition 1
  • Definition 2: Alignment
  • Definition 3: Alignment, Multi-Class Case
  • Lemma 1
  • Definition 4: Linearized Robustness
  • Lemma 2: Linearized Robustness of Homogeneous Classifiers
  • Theorem 1: Homogeneous Decomposition of Neural Networks
  • Theorem 2
  • Theorem 3
  • Lemma 2
  • ...and 14 more