Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations
Supriya Manna, Niladri Sett
TL;DR
This paper introduces adversarial sensitivity as a necessary test for faithfulness in NLP explanations, arguing that faithful explainers should reflect changes in model reasoning when inputs are adversarially perturbed. It defines precise notions for adversarial examples, local explanations, and a robust distance metric based on incomplete rankings, and proposes three disjoint attack classes (word-level, character-level, and behavioral invariance) to probe explanations. An extensive experimental framework across SST-2, AG News, and Twitter Hate with DistillBERT and BERT shows that perturbation-based explainers (LIME, SHAP) and gradient-based variants (Grad × Input, IG × Input) often exhibit stronger adversarial sensitivity than vanilla gradients, while erasure-based metrics can diverge. The findings advocate adopting adversarial sensitivity as a foundational, model-agnostic tool for assessing explainers, with implications for reliable deployment and future work in multilingual and low-resource settings.
Abstract
Faithfulness is arguably the most critical metric to assess the reliability of explainable AI. In NLP, current methods for faithfulness evaluation are fraught with discrepancies and biases, often failing to capture the true reasoning of models. We introduce Adversarial Sensitivity as a novel approach to faithfulness evaluation, focusing on the explainer's response when the model is under adversarial attack. Our method accounts for the faithfulness of explainers by capturing sensitivity to adversarial input changes. This work addresses significant limitations in existing evaluation techniques, and furthermore, quantifies faithfulness from a crucial yet underexplored paradigm.
