Fooling SHAP with Output Shuffling Attacks
Jun Yuan, Aritra Dasgupta
TL;DR
The paper introduces data-agnostic shuffling attacks that manipulate SHAP-based explanations without requiring training data access, proving that Shapley values cannot detect such attacks while SHAP estimators may, depending on the implementation. It formulates three attack classes—Dominance, Mixing, and Swapping—along with relaxation and data-access variants, and analytically derives how these attacks affect feature attributions under linear and nonlinear models. Through experiments on Graduate Admissions, Diabetes Risk, and German Credit datasets, the authors show that SHAP detection is uneven and can be bypassed in practice, while fairness metrics reveal substantial degradation in group fairness under several attack configurations. The work highlights a vulnerability in SHAP-based auditing and motivates the development of robust defenses, alternative explainers, and auditing frameworks to ensure reliable fairness assessments in socio-technical settings.
Abstract
Explainable AI~(XAI) methods such as SHAP can help discover feature attributions in black-box models. If the method reveals a significant attribution from a ``protected feature'' (e.g., gender, race) on the model output, the model is considered unfair. However, adversarial attacks can subvert the detection of XAI methods. Previous approaches to constructing such an adversarial model require access to underlying data distribution, which may not be possible in many practical scenarios. We relax this constraint and propose a novel family of attacks, called shuffling attacks, that are data-agnostic. The proposed attack strategies can adapt any trained machine learning model to fool Shapley value-based explanations. We prove that Shapley values cannot detect shuffling attacks. However, algorithms that estimate Shapley values, such as linear SHAP and SHAP, can detect these attacks with varying degrees of effectiveness. We demonstrate the efficacy of the attack strategies by comparing the performance of linear SHAP and SHAP using real-world datasets.
