Table of Contents
Fetching ...

Towards Robust and Accurate Stability Estimation of Local Surrogate Models in Text-based Explainable AI

Christopher Burger, Charles Walter, Thai Le, Lingwei Chen

TL;DR

The paper tackles stability estimation for text-based local surrogate explanations in XAI, focusing on how the choice of similarity measure for ranked explanations affects robustness conclusions under adversarial perturbations. It systematically compares several measures—Jaccard, Kendall's Tau, Spearman's Footrule, and Rank-Biased Overlap (RBO)—and finds that some measures can be excessively sensitive or overly coarse, distorting the assessment of XAI stability. To address this, it introduces synonymity weighting via a Syn$(a,b)\in[0,1]$ function, alongside a mapping between explanations, to make similarity computations more faithful to semantic closeness; this is backed by embedding-based synonym estimates. Empirical validation on DistilBERT explanations for two text datasets shows that synonymity weighting reduces attack success for sensitive measures (notably Jaccard and Spearman) and preserves robustness signals for RBO, offering practical guidance for robust XAI evaluation with minimal overhead.

Abstract

Recent work has investigated the concept of adversarial attacks on explainable AI (XAI) in the NLP domain with a focus on examining the vulnerability of local surrogate methods such as Lime to adversarial perturbations or small changes on the input of a machine learning (ML) model. In such attacks, the generated explanation is manipulated while the meaning and structure of the original input remain similar under the ML model. Such attacks are especially alarming when XAI is used as a basis for decision making (e.g., prescribing drugs based on AI medical predictors) or for legal action (e.g., legal dispute involving AI software). Although weaknesses across many XAI methods have been shown to exist, the reasons behind why remain little explored. Central to this XAI manipulation is the similarity measure used to calculate how one explanation differs from another. A poor choice of similarity measure can lead to erroneous conclusions about the stability or adversarial robustness of an XAI method. Therefore, this work investigates a variety of similarity measures designed for text-based ranked lists referenced in related work to determine their comparative suitability for use. We find that many measures are overly sensitive, resulting in erroneous estimates of stability. We then propose a weighting scheme for text-based data that incorporates the synonymity between the features within an explanation, providing more accurate estimates of the actual weakness of XAI methods to adversarial examples.

Towards Robust and Accurate Stability Estimation of Local Surrogate Models in Text-based Explainable AI

TL;DR

The paper tackles stability estimation for text-based local surrogate explanations in XAI, focusing on how the choice of similarity measure for ranked explanations affects robustness conclusions under adversarial perturbations. It systematically compares several measures—Jaccard, Kendall's Tau, Spearman's Footrule, and Rank-Biased Overlap (RBO)—and finds that some measures can be excessively sensitive or overly coarse, distorting the assessment of XAI stability. To address this, it introduces synonymity weighting via a Syn function, alongside a mapping between explanations, to make similarity computations more faithful to semantic closeness; this is backed by embedding-based synonym estimates. Empirical validation on DistilBERT explanations for two text datasets shows that synonymity weighting reduces attack success for sensitive measures (notably Jaccard and Spearman) and preserves robustness signals for RBO, offering practical guidance for robust XAI evaluation with minimal overhead.

Abstract

Recent work has investigated the concept of adversarial attacks on explainable AI (XAI) in the NLP domain with a focus on examining the vulnerability of local surrogate methods such as Lime to adversarial perturbations or small changes on the input of a machine learning (ML) model. In such attacks, the generated explanation is manipulated while the meaning and structure of the original input remain similar under the ML model. Such attacks are especially alarming when XAI is used as a basis for decision making (e.g., prescribing drugs based on AI medical predictors) or for legal action (e.g., legal dispute involving AI software). Although weaknesses across many XAI methods have been shown to exist, the reasons behind why remain little explored. Central to this XAI manipulation is the similarity measure used to calculate how one explanation differs from another. A poor choice of similarity measure can lead to erroneous conclusions about the stability or adversarial robustness of an XAI method. Therefore, this work investigates a variety of similarity measures designed for text-based ranked lists referenced in related work to determine their comparative suitability for use. We find that many measures are overly sensitive, resulting in erroneous estimates of stability. We then propose a weighting scheme for text-based data that incorporates the synonymity between the features within an explanation, providing more accurate estimates of the actual weakness of XAI methods to adversarial examples.
Paper Structure (27 sections, 7 equations, 1 figure, 4 tables)

This paper contains 27 sections, 7 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Successful attack rates under threshold $\tau$ for standard and synonymity weighted explanations (Base Measure (Blue) - Synonymity Weighted Measure (Orange))