Table of Contents
Fetching ...

Revealing Vulnerabilities of Neural Networks in Parameter Learning and Defense Against Explanation-Aware Backdoors

Md Abdul Kadir, GowthamKrishna Addluri, Daniel Sonntag

TL;DR

Defences against most modern explanation-aware adversarial attacks are suggested, achieving an approximate decrease in the Attack Success Rate (ASR) and the Mean Square Error (MSE) between the original explanation and the defended (post-attack) explanation across three unique types of attacks.

Abstract

Explainable Artificial Intelligence (XAI) strategies play a crucial part in increasing the understanding and trustworthiness of neural networks. Nonetheless, these techniques could potentially generate misleading explanations. Blinding attacks can drastically alter a machine learning algorithm's prediction and explanation, providing misleading information by adding visually unnoticeable artifacts into the input, while maintaining the model's accuracy. It poses a serious challenge in ensuring the reliability of XAI methods. To ensure the reliability of XAI methods poses a real challenge, we leverage statistical analysis to highlight the changes in CNN weights within a CNN following blinding attacks. We introduce a method specifically designed to limit the effectiveness of such attacks during the evaluation phase, avoiding the need for extra training. The method we suggest defences against most modern explanation-aware adversarial attacks, achieving an approximate decrease of ~99\% in the Attack Success Rate (ASR) and a ~91\% reduction in the Mean Square Error (MSE) between the original explanation and the defended (post-attack) explanation across three unique types of attacks.

Revealing Vulnerabilities of Neural Networks in Parameter Learning and Defense Against Explanation-Aware Backdoors

TL;DR

Defences against most modern explanation-aware adversarial attacks are suggested, achieving an approximate decrease in the Attack Success Rate (ASR) and the Mean Square Error (MSE) between the original explanation and the defended (post-attack) explanation across three unique types of attacks.

Abstract

Explainable Artificial Intelligence (XAI) strategies play a crucial part in increasing the understanding and trustworthiness of neural networks. Nonetheless, these techniques could potentially generate misleading explanations. Blinding attacks can drastically alter a machine learning algorithm's prediction and explanation, providing misleading information by adding visually unnoticeable artifacts into the input, while maintaining the model's accuracy. It poses a serious challenge in ensuring the reliability of XAI methods. To ensure the reliability of XAI methods poses a real challenge, we leverage statistical analysis to highlight the changes in CNN weights within a CNN following blinding attacks. We introduce a method specifically designed to limit the effectiveness of such attacks during the evaluation phase, avoiding the need for extra training. The method we suggest defences against most modern explanation-aware adversarial attacks, achieving an approximate decrease of ~99\% in the Attack Success Rate (ASR) and a ~91\% reduction in the Mean Square Error (MSE) between the original explanation and the defended (post-attack) explanation across three unique types of attacks.
Paper Structure (20 sections, 12 equations, 24 figures, 8 tables)

This paper contains 20 sections, 12 equations, 24 figures, 8 tables.

Figures (24)

  • Figure 1: The above figure presents some examples of attacks and defenses on the Grad-CAM explainer. Examples of three attack methods - Simple Fooling (SF), Red Herring (RH), and Full Disguise (FD) - are shown in the Triggered column and the examples of their defenses are presented in Defense column.
  • Figure 2: The figure presents the examples of Simple Fooling (SF), Red Herring (RH), and Full Disguise (FD) attack profiles, chronologically displayed from left to right. Each left sub-column depicts the regular prediction and explanation in the absence of a trigger in the input, signifying the normal behaviour of the un-attacked model. In contrast, the right sub-columns illustrate instances where a square trigger in the input has introduced an artificial explanation and the targeted prediction. To enumerate, in an SF attack, the explanation becomes targeted, subsequently altering the model's explanation. Similarly, in the case of an RH attack, both prediction and explanation adopt targeted prediction and explanation. On the contrary, an FD attack specifically targets the prediction, while the explanation remains consistent with an un-attacked model. It's worth noting that an attack can form any representation in the explanation. For simplicity, we attack the model to generate a square box as the targeted explanation.
  • Figure 3: The CKA correlation between layers of original models and attacked models is presented. The top sub-plot displays the CKA where models contain BN layers. The bottom left sub-figure presents CKA scores between models that do not have trainable BN parameters. The bottom right sub-plot illustrates CKA scores between models that lack any BN layers. We observe that when BN layers are utilized with parameter learning, the model's core weights exhibit more significant CKA correlation with the original weight than models that either have BN with no trainable parameters or lack BN entirely.
  • Figure 4: : The activation of the final convolutional layer with BN is shown in (a), and its replacement with CFN. It illustrates that the targeted explanation is evident when we apply the learned BN parameter from the attack. Substituting it with CFN eliminates the attacker's artefacts.
  • Figure 5: This figure demonstrates the Spearman's Rank Correlation (SRC) distribution between original model explanations and attacked, then defended explanations after SF, and RH attacks on both datasets. Defense column displays heightened correlation between attacked model's explanations and the original following defense.
  • ...and 19 more figures