Table of Contents
Fetching ...

Interpretation of Neural Networks is Susceptible to Universal Adversarial Perturbations

Haniyeh Ehsani Oskouie, Farzan Farnia

TL;DR

This work extends universal adversarial perturbations to the realm of neural network interpretations by targeting gradient-based saliency maps. It introduces two methods, UPI-Grad and UPI-PCA, to craft a single perturbation $\delta$ (with $\|\delta\| \le \epsilon$) that maximally degrades the quality of explanations across diverse inputs, leveraging a Gaussian-smoothed objective over $Z\sim\mathcal{N}(\mathbf{0},\sigma^2 I)$. The authors show that, in a first-order sense, the optimal perturbation aligns with the top singular vector of a gradient matrix $G$, enabling a PCA-based stochastic power method for computation. Empirical results on datasets such as Tiny-ImageNet, CIFAR-10, and MNIST demonstrate that UPIs significantly distort gradient-based interpretations across models (e.g., $I_{SG}$ and $I_{IG}$), often approaching the impact of per-image attacks and exhibiting transferability across architectures, thereby underscoring the vulnerability of explanations in adversarial settings.

Abstract

Interpreting neural network classifiers using gradient-based saliency maps has been extensively studied in the deep learning literature. While the existing algorithms manage to achieve satisfactory performance in application to standard image recognition datasets, recent works demonstrate the vulnerability of widely-used gradient-based interpretation schemes to norm-bounded perturbations adversarially designed for every individual input sample. However, such adversarial perturbations are commonly designed using the knowledge of an input sample, and hence perform sub-optimally in application to an unknown or constantly changing data point. In this paper, we show the existence of a Universal Perturbation for Interpretation (UPI) for standard image datasets, which can alter a gradient-based feature map of neural networks over a significant fraction of test samples. To design such a UPI, we propose a gradient-based optimization method as well as a principal component analysis (PCA)-based approach to compute a UPI which can effectively alter a neural network's gradient-based interpretation on different samples. We support the proposed UPI approaches by presenting several numerical results of their successful applications to standard image datasets.

Interpretation of Neural Networks is Susceptible to Universal Adversarial Perturbations

TL;DR

This work extends universal adversarial perturbations to the realm of neural network interpretations by targeting gradient-based saliency maps. It introduces two methods, UPI-Grad and UPI-PCA, to craft a single perturbation (with ) that maximally degrades the quality of explanations across diverse inputs, leveraging a Gaussian-smoothed objective over . The authors show that, in a first-order sense, the optimal perturbation aligns with the top singular vector of a gradient matrix , enabling a PCA-based stochastic power method for computation. Empirical results on datasets such as Tiny-ImageNet, CIFAR-10, and MNIST demonstrate that UPIs significantly distort gradient-based interpretations across models (e.g., and ), often approaching the impact of per-image attacks and exhibiting transferability across architectures, thereby underscoring the vulnerability of explanations in adversarial settings.

Abstract

Interpreting neural network classifiers using gradient-based saliency maps has been extensively studied in the deep learning literature. While the existing algorithms manage to achieve satisfactory performance in application to standard image recognition datasets, recent works demonstrate the vulnerability of widely-used gradient-based interpretation schemes to norm-bounded perturbations adversarially designed for every individual input sample. However, such adversarial perturbations are commonly designed using the knowledge of an input sample, and hence perform sub-optimally in application to an unknown or constantly changing data point. In this paper, we show the existence of a Universal Perturbation for Interpretation (UPI) for standard image datasets, which can alter a gradient-based feature map of neural networks over a significant fraction of test samples. To design such a UPI, we propose a gradient-based optimization method as well as a principal component analysis (PCA)-based approach to compute a UPI which can effectively alter a neural network's gradient-based interpretation on different samples. We support the proposed UPI approaches by presenting several numerical results of their successful applications to standard image datasets.
Paper Structure (10 sections, 3 theorems, 19 equations, 10 figures, 2 tables, 2 algorithms)

This paper contains 10 sections, 3 theorems, 19 equations, 10 figures, 2 tables, 2 algorithms.

Key Result

Proposition 1

Consider the objective function in UPI Optimization: Universal Direction. Suppose that every summation term $\mathcal{D}(I(x),I(x+\delta))$ is $L$-Lipschitz in $\delta$. Then, assuming that $\tau:=\frac{L\sqrt{d}\epsilon^2}{\sigma} <\lambda$, for every $\Vert\delta\Vert_2\le\epsilon$ we will have

Figures (10)

  • Figure 1: Interpretation of VGG-16 on two Tiny-ImageNet samples before and after adding the perturbations.
  • Figure 2: Visualization of UPI perturbations in VGG-16 experiments. The top and bottom rows are the Tiny-ImageNet and CIFAR-10 UPIs.
  • Figure 3: Cross-correlation between generated UPIs in the Tiny-ImageNet experiments.
  • Figure 4: Cross-correlation between generated UPIs in the CIFAR-10 experiments.
  • Figure 5: Cross-correlation between generated UPIs in the MNIST experiments.
  • ...and 5 more figures

Theorems & Definitions (5)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Lemma 1: Stein's lemma landsman2008stein