Table of Contents
Fetching ...

Explainable but Vulnerable: Adversarial Attacks on XAI Explanation in Cybersecurity Applications

Maraz Mia, Mir Mehedi A. Pritom

TL;DR

This paper investigates the vulnerability of post-hoc XAI explanations to adversarial manipulation in cybersecurity settings. It formalizes FE, ME, and BD within a layered TTP framework and evaluates six attack procedures across SHAP, LIME, and IG on four tabular cybersecurity datasets (phishing, IDS, malware, e-commerce). Across case studies, FE attacks prove highly effective at hiding protected-feature importance, while ME and BD show more limited impact; some defenses exist (Scaffolding OOD, Biased Sampling) but gaps remain. The work underscores the need for resilient, multi-method defenses in XAI and provides a repository of methods and results to guide future defense research with practical implications for cyber defense workflows.

Abstract

Explainable Artificial Intelligence (XAI) has aided machine learning (ML) researchers with the power of scrutinizing the decisions of the black-box models. XAI methods enable looking deep inside the models' behavior, eventually generating explanations along with a perceived trust and transparency. However, depending on any specific XAI method, the level of trust can vary. It is evident that XAI methods can themselves be a victim of post-adversarial attacks that manipulate the expected outcome from the explanation module. Among such attack tactics, fairwashing explanation (FE), manipulation explanation (ME), and backdoor-enabled manipulation attacks (BD) are the notable ones. In this paper, we try to understand these adversarial attack techniques, tactics, and procedures (TTPs) on explanation alteration and thus the effect on the model's decisions. We have explored a total of six different individual attack procedures on post-hoc explanation methods such as SHAP (SHapley Additive exPlanations), LIME (Local Interpretable Model-agnostic Explanation), and IG (Integrated Gradients), and investigated those adversarial attacks in cybersecurity applications scenarios such as phishing, malware, intrusion, and fraudulent website detection. Our experimental study reveals the actual effectiveness of these attacks, thus providing an urgency for immediate attention to enhance the resiliency of XAI methods and their applications.

Explainable but Vulnerable: Adversarial Attacks on XAI Explanation in Cybersecurity Applications

TL;DR

This paper investigates the vulnerability of post-hoc XAI explanations to adversarial manipulation in cybersecurity settings. It formalizes FE, ME, and BD within a layered TTP framework and evaluates six attack procedures across SHAP, LIME, and IG on four tabular cybersecurity datasets (phishing, IDS, malware, e-commerce). Across case studies, FE attacks prove highly effective at hiding protected-feature importance, while ME and BD show more limited impact; some defenses exist (Scaffolding OOD, Biased Sampling) but gaps remain. The work underscores the need for resilient, multi-method defenses in XAI and provides a repository of methods and results to guide future defense research with practical implications for cyber defense workflows.

Abstract

Explainable Artificial Intelligence (XAI) has aided machine learning (ML) researchers with the power of scrutinizing the decisions of the black-box models. XAI methods enable looking deep inside the models' behavior, eventually generating explanations along with a perceived trust and transparency. However, depending on any specific XAI method, the level of trust can vary. It is evident that XAI methods can themselves be a victim of post-adversarial attacks that manipulate the expected outcome from the explanation module. Among such attack tactics, fairwashing explanation (FE), manipulation explanation (ME), and backdoor-enabled manipulation attacks (BD) are the notable ones. In this paper, we try to understand these adversarial attack techniques, tactics, and procedures (TTPs) on explanation alteration and thus the effect on the model's decisions. We have explored a total of six different individual attack procedures on post-hoc explanation methods such as SHAP (SHapley Additive exPlanations), LIME (Local Interpretable Model-agnostic Explanation), and IG (Integrated Gradients), and investigated those adversarial attacks in cybersecurity applications scenarios such as phishing, malware, intrusion, and fraudulent website detection. Our experimental study reveals the actual effectiveness of these attacks, thus providing an urgency for immediate attention to enhance the resiliency of XAI methods and their applications.

Paper Structure

This paper contains 49 sections, 10 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: TTP layers for XAI explanation attacks
  • Figure 2: Overview of our proposed methodology
  • Figure 3: A successful Fairwashed Explanation for XGB model in phishing dataset for Output Shuffling attack
  • Figure 4: Performance of the models other than the baseline - UN_* are the unbiased models trained on without the protected feature, OOD_* are the adversarial OOD models
  • Figure 5: Effective Fairwashed Explanation in both LIME and SHAP for e-commerce dataset and SMLP model for the Scaffolding OOD attack (overall rank changed for the protected feature - red bar)
  • ...and 4 more figures