Table of Contents
Fetching ...

eXIAA: eXplainable Injections for Adversarial Attack

Leonardo Pesce, Jiawen Wei, Gianmarco Mengaldo

TL;DR

This paper addresses the vulnerability of post-hoc XAI explanations to adversarial manipulation without requiring access to model weights. It introduces eXIAA, a black-box, model-agnostic, one-step attack that selects an attack image from the running-up class, extracts top-$k$ attribution features via a post-hoc method, and injects them into the target image with a weighted blend, maximizing explanation disruption while preserving the predicted class and maintaining perceptual similarity. The approach is evaluated on ImageNet with ResNet-18 and ViT-B16 across Saliency maps, Integrated Gradients, and DeepLIFT SHAP, showing substantial changes in explanations with minimal prediction change and high SSIM. The results reveal a critical vulnerability in current XAI methods, especially for transformer architectures, and motivate the development of more robust, trustworthy explainability techniques, potentially extending to other data modalities and multi-label tasks. Ethical considerations and reproducibility statements accompany the work, with code to be released upon acceptance.

Abstract

Post-hoc explainability methods are a subset of Machine Learning (ML) that aim to provide a reason for why a model behaves in a certain way. In this paper, we show a new black-box model-agnostic adversarial attack for post-hoc explainable Artificial Intelligence (XAI), particularly in the image domain. The goal of the attack is to modify the original explanations while being undetected by the human eye and maintain the same predicted class. In contrast to previous methods, we do not require any access to the model or its weights, but only to the model's computed predictions and explanations. Additionally, the attack is accomplished in a single step while significantly changing the provided explanations, as demonstrated by empirical evaluation. The low requirements of our method expose a critical vulnerability in current explainability methods, raising concerns about their reliability in safety-critical applications. We systematically generate attacks based on the explanations generated by post-hoc explainability methods (saliency maps, integrated gradients, and DeepLIFT SHAP) for pretrained ResNet-18 and ViT-B16 on ImageNet. The results show that our attacks could lead to dramatically different explanations without changing the predictive probabilities. We validate the effectiveness of our attack, compute the induced change based on the explanation with mean absolute difference, and verify the closeness of the original image and the corrupted one with the Structural Similarity Index Measure (SSIM).

eXIAA: eXplainable Injections for Adversarial Attack

TL;DR

This paper addresses the vulnerability of post-hoc XAI explanations to adversarial manipulation without requiring access to model weights. It introduces eXIAA, a black-box, model-agnostic, one-step attack that selects an attack image from the running-up class, extracts top- attribution features via a post-hoc method, and injects them into the target image with a weighted blend, maximizing explanation disruption while preserving the predicted class and maintaining perceptual similarity. The approach is evaluated on ImageNet with ResNet-18 and ViT-B16 across Saliency maps, Integrated Gradients, and DeepLIFT SHAP, showing substantial changes in explanations with minimal prediction change and high SSIM. The results reveal a critical vulnerability in current XAI methods, especially for transformer architectures, and motivate the development of more robust, trustworthy explainability techniques, potentially extending to other data modalities and multi-label tasks. Ethical considerations and reproducibility statements accompany the work, with code to be released upon acceptance.

Abstract

Post-hoc explainability methods are a subset of Machine Learning (ML) that aim to provide a reason for why a model behaves in a certain way. In this paper, we show a new black-box model-agnostic adversarial attack for post-hoc explainable Artificial Intelligence (XAI), particularly in the image domain. The goal of the attack is to modify the original explanations while being undetected by the human eye and maintain the same predicted class. In contrast to previous methods, we do not require any access to the model or its weights, but only to the model's computed predictions and explanations. Additionally, the attack is accomplished in a single step while significantly changing the provided explanations, as demonstrated by empirical evaluation. The low requirements of our method expose a critical vulnerability in current explainability methods, raising concerns about their reliability in safety-critical applications. We systematically generate attacks based on the explanations generated by post-hoc explainability methods (saliency maps, integrated gradients, and DeepLIFT SHAP) for pretrained ResNet-18 and ViT-B16 on ImageNet. The results show that our attacks could lead to dramatically different explanations without changing the predictive probabilities. We validate the effectiveness of our attack, compute the induced change based on the explanation with mean absolute difference, and verify the closeness of the original image and the corrupted one with the Structural Similarity Index Measure (SSIM).

Paper Structure

This paper contains 10 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: A scheme representing the structure of our adversarial attack on explanations. The classifications' probabilities of the original image (dog) are computed through the classifier, from here an attack image(cat) from the running-up class is selected. The same post-hoc explainability method of the original image is used to extract the positive attribution of the attack image. With the top-$k$ positive features selected, we mask the attack image ($\bigotimes$ symbol) and obtain the attack injections which are combined with the original image with an $\alpha$-weighted sum ($\bigoplus$ symbol) to create the corrupted image. The final corrupted image is not distinguishable from the original one by the human eye, but leads to different explanations.
  • Figure 2: Percentage change of explanations with different $\alpha$. Each graph represents a pair of classifier and explainability method. The x-axis represents the top-$k$ while the y-axis represents the change in explanations. The graphs in the same row share the same y-axis scale. Each line represents the mean and standard deviation of a different value of $\alpha$ as represented in the legend, and corresponding dotted line represents the baseline performance relative to the same $\alpha$.
  • Figure 3: SSIM between the original image and the corrupted one. The structure follows the same one as Figure \ref{['fig:model percentage difference']}.
  • Figure 4: The confidence absolute change for the predicted class of the original image vs the corrupted image. The structure follows the same one as Figure \ref{['fig:model percentage difference']}.
  • Figure 5: The figure compares the mean and standard deviation of injecting the original image with one from a running-up class (full line) vs the average of using all the other classes (dotted lines). The lines have been computed on the CIFAR10 dataset, with a ResNet-18 and different explainability methods (saliency maps, integrated gradients, and DeepLIFT SHAP). The computation scales linearly in the number of classes; therefore, it's not practical to compute the same graph on ImageNet.
  • ...and 1 more figures