Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations

Supriya Manna; Niladri Sett

Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations

Supriya Manna, Niladri Sett

TL;DR

This paper introduces adversarial sensitivity as a necessary test for faithfulness in NLP explanations, arguing that faithful explainers should reflect changes in model reasoning when inputs are adversarially perturbed. It defines precise notions for adversarial examples, local explanations, and a robust distance metric based on incomplete rankings, and proposes three disjoint attack classes (word-level, character-level, and behavioral invariance) to probe explanations. An extensive experimental framework across SST-2, AG News, and Twitter Hate with DistillBERT and BERT shows that perturbation-based explainers (LIME, SHAP) and gradient-based variants (Grad × Input, IG × Input) often exhibit stronger adversarial sensitivity than vanilla gradients, while erasure-based metrics can diverge. The findings advocate adopting adversarial sensitivity as a foundational, model-agnostic tool for assessing explainers, with implications for reliable deployment and future work in multilingual and low-resource settings.

Abstract

Faithfulness is arguably the most critical metric to assess the reliability of explainable AI. In NLP, current methods for faithfulness evaluation are fraught with discrepancies and biases, often failing to capture the true reasoning of models. We introduce Adversarial Sensitivity as a novel approach to faithfulness evaluation, focusing on the explainer's response when the model is under adversarial attack. Our method accounts for the faithfulness of explainers by capturing sensitivity to adversarial input changes. This work addresses significant limitations in existing evaluation techniques, and furthermore, quantifies faithfulness from a crucial yet underexplored paradigm.

Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations

TL;DR

Abstract

Faithfulness and the Notion of Adversarial Sensitivity in NLP Explanations

Authors

TL;DR

Abstract

Table of Contents