Table of Contents
Fetching ...

A Vulnerability of Attribution Methods Using Pre-Softmax Scores

Miguel Lerma, Mirtha Lucas

TL;DR

The paper addresses a vulnerability in attribution methods that rely on pre-softmax scores $z_i$ in CNN classifiers, showing that manipulating these logits can distort heatmaps without changing post-softmax predictions $y_i$. It presents a theoretical insight that the softmax is invariant to an additive scalar $t$ applied to all $z_i$, yet the gradients of $z_i$ with respect to inputs can change if $t$ depends on the input, enabling adversarial manipulation of attributions. A concrete proof-of-concept uses a VGG19-based model with $t = K \sum_k A_{00k}$ to distort Grad-CAM heatmaps computed from pre-softmax scores while leaving post-softmax heatmaps unaffected. The findings highlight a practical risk for explanations and robustness evaluations, suggesting that attribution methods based on post-softmax scores are more reliable and that broader defenses and evaluations are needed for attribution techniques.

Abstract

We discuss a vulnerability involving a category of attribution methods used to provide explanations for the outputs of convolutional neural networks working as classifiers. It is known that this type of networks are vulnerable to adversarial attacks, in which imperceptible perturbations of the input may alter the outputs of the model. In contrast, here we focus on effects that small modifications in the model may cause on the attribution method without altering the model outputs.

A Vulnerability of Attribution Methods Using Pre-Softmax Scores

TL;DR

The paper addresses a vulnerability in attribution methods that rely on pre-softmax scores in CNN classifiers, showing that manipulating these logits can distort heatmaps without changing post-softmax predictions . It presents a theoretical insight that the softmax is invariant to an additive scalar applied to all , yet the gradients of with respect to inputs can change if depends on the input, enabling adversarial manipulation of attributions. A concrete proof-of-concept uses a VGG19-based model with to distort Grad-CAM heatmaps computed from pre-softmax scores while leaving post-softmax heatmaps unaffected. The findings highlight a practical risk for explanations and robustness evaluations, suggesting that attribution methods based on post-softmax scores are more reliable and that broader defenses and evaluations are needed for attribution techniques.

Abstract

We discuss a vulnerability involving a category of attribution methods used to provide explanations for the outputs of convolutional neural networks working as classifiers. It is known that this type of networks are vulnerable to adversarial attacks, in which imperceptible perturbations of the input may alter the outputs of the model. In contrast, here we focus on effects that small modifications in the model may cause on the attribution method without altering the model outputs.
Paper Structure (7 sections, 6 equations, 5 figures)

This paper contains 7 sections, 6 equations, 5 figures.

Figures (5)

  • Figure 1: Structure of a typical classifier network. After a number of convolutional blocks this kind of network ends with a fully connected network producing a (pre-softmax) output z, followed by a softmax activation function with (post-softmax) output y.
  • Figure 2: Example of alteration of a classifier network that changes attributions based on pre-softmax scores without changing post-softmax scores.
  • Figure 3: Heatmaps produced by Grad-CAM using pre-softmax and post-softmax outputs respectively, intended to locate the position of the soccer ball. The original model is a VGG19 network pretrained on ImageNet. The altered model is the same VGG19 network slightly modified, but still functionally equivalent (same final outputs) to the original network. The heatmaps are computed at the last convolutional layer of each model. Note that Grad-CAM working on pre-softmax outputs has been tricked to produce wrong heatmaps. The heatmaps obtained using post-softmax outputs remain unchanged.
  • Figure 4: The altered model tends to produce the same heatmap regardless of the class assigned to the image. In this case Grad-CAM is used to locate a "maze" rather than a soccer ball in the image. The pre-softmax version of the heatmap on the altered model keeps highlighting the same upper left corner, while the other heatmaps focus on the lines drawn on the grass.
  • Figure 5: Another example showing the heatmap computed with pre-softmax outputs of the altered model concentrated in the upper left corner of the image. Heatmaps computed with post-softmax outputs remain unaltered highlighting the position of the dog.