Fooling SHAP with Output Shuffling Attacks

Jun Yuan; Aritra Dasgupta

Fooling SHAP with Output Shuffling Attacks

Jun Yuan, Aritra Dasgupta

TL;DR

The paper introduces data-agnostic shuffling attacks that manipulate SHAP-based explanations without requiring training data access, proving that Shapley values cannot detect such attacks while SHAP estimators may, depending on the implementation. It formulates three attack classes—Dominance, Mixing, and Swapping—along with relaxation and data-access variants, and analytically derives how these attacks affect feature attributions under linear and nonlinear models. Through experiments on Graduate Admissions, Diabetes Risk, and German Credit datasets, the authors show that SHAP detection is uneven and can be bypassed in practice, while fairness metrics reveal substantial degradation in group fairness under several attack configurations. The work highlights a vulnerability in SHAP-based auditing and motivates the development of robust defenses, alternative explainers, and auditing frameworks to ensure reliable fairness assessments in socio-technical settings.

Abstract

Explainable AI~(XAI) methods such as SHAP can help discover feature attributions in black-box models. If the method reveals a significant attribution from a ``protected feature'' (e.g., gender, race) on the model output, the model is considered unfair. However, adversarial attacks can subvert the detection of XAI methods. Previous approaches to constructing such an adversarial model require access to underlying data distribution, which may not be possible in many practical scenarios. We relax this constraint and propose a novel family of attacks, called shuffling attacks, that are data-agnostic. The proposed attack strategies can adapt any trained machine learning model to fool Shapley value-based explanations. We prove that Shapley values cannot detect shuffling attacks. However, algorithms that estimate Shapley values, such as linear SHAP and SHAP, can detect these attacks with varying degrees of effectiveness. We demonstrate the efficacy of the attack strategies by comparing the performance of linear SHAP and SHAP using real-world datasets.

Fooling SHAP with Output Shuffling Attacks

TL;DR

Abstract

Paper Structure (16 sections, 16 equations, 4 figures, 4 algorithms)

This paper contains 16 sections, 16 equations, 4 figures, 4 algorithms.

Introduction
Related Work
Attack Strategies
Shapley values
Adversarial shuffling
Shuffling Attack
Techniques to modify the attacks
Estimating SHAP's detection capability
Experiments
Datasets
Graduate admissions prediction
Diabetes prediction
Credit prediction
Discussion
Conclusion and Future Work
...and 1 more sections

Figures (4)

Figure 1: Attack effect on ground truth data for admission prediction. The yellow and green lines respectively indicate students with and without research experience. The yellow group's scores all drooped from the base function $f$ under different attacks (Swapping, Mixing, Dominance).
Figure 2: Admission prediction experiment For the three scoring features (GRE, TOEFL, University Rating), the detected attributions from three explainers (LIME, SHAP, and linearSHAP) are similar. For the protected feature (Research), LIME detects more attributions as the attacks' parameters increase. A higher parameter indicates more distortion of the score vector output from the original function. For Swapping attack, the parameter is the percentile in the score vector that swapping is conducted multiple times to keep students with Research above students without. For Mixing attack, the parameter is the probability of bias towards students with Research.
Figure 3: Diabetes prediction experiment Each column describes the SHAP results and fairness measurements of an attack. An attack is defined by the protected features that are used in attack construction (gender or both gender and age), and which specific attack algorithm (Dominance, Mixing, Swapping, or none-attack) is applied in the top and bottom half of any given input data.
Figure 4: German credit experiment The first row shows that if no attack happens, the loan rate occurs at top-1 importance for 100% of the data instance. The last row shows that if the attack applies to all input data, the gender feature occurs at top-1 importance 67.1% times, and at top-2 7.4%. at top-3 13.4%. When the attack applies to the top 15% of data, the gender feature has the least occurrence on top-3 importance.

Fooling SHAP with Output Shuffling Attacks

TL;DR

Abstract

Fooling SHAP with Output Shuffling Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)