A Vulnerability of Attribution Methods Using Pre-Softmax Scores
Miguel Lerma, Mirtha Lucas
TL;DR
The paper addresses a vulnerability in attribution methods that rely on pre-softmax scores $z_i$ in CNN classifiers, showing that manipulating these logits can distort heatmaps without changing post-softmax predictions $y_i$. It presents a theoretical insight that the softmax is invariant to an additive scalar $t$ applied to all $z_i$, yet the gradients of $z_i$ with respect to inputs can change if $t$ depends on the input, enabling adversarial manipulation of attributions. A concrete proof-of-concept uses a VGG19-based model with $t = K \sum_k A_{00k}$ to distort Grad-CAM heatmaps computed from pre-softmax scores while leaving post-softmax heatmaps unaffected. The findings highlight a practical risk for explanations and robustness evaluations, suggesting that attribution methods based on post-softmax scores are more reliable and that broader defenses and evaluations are needed for attribution techniques.
Abstract
We discuss a vulnerability involving a category of attribution methods used to provide explanations for the outputs of convolutional neural networks working as classifiers. It is known that this type of networks are vulnerable to adversarial attacks, in which imperceptible perturbations of the input may alter the outputs of the model. In contrast, here we focus on effects that small modifications in the model may cause on the attribution method without altering the model outputs.
