Improving Stability Estimates in Adversarial Explainable AI through Alternate Search Methods
Christopher Burger, Charles Walter
TL;DR
This paper addresses the stability of text-based XAI explanations by assessing how few perturbations can meaningfully alter an explanation while preserving the model's output. It proposes a genetic algorithm as an alternative search method to efficiently locate minimum perturbations under semantic similarity constraints and top-k feature preservation, contrasting with a greedy baseline. Using ranking-based explanation similarity measures, including Rank-biased Overlap variants, Jaccard, Kendall's tau, and Spearman footrule, the study demonstrates that the GA can reveal more precise bounds on explanation instability and can reduce the perturbation count in practice. Experiments on Twitter GB and S2D text datasets show that while the GA sometimes outperforms the greedy approach, perturbation-based vulnerability of Lime explanations persists and is computationally costly, underscoring the need for scalable, robust XAI evaluation methods.
Abstract
Advances in the effectiveness of machine learning models have come at the cost of enormous complexity resulting in a poor understanding of how they function. Local surrogate methods have been used to approximate the workings of these complex models, but recent work has revealed their vulnerability to adversarial attacks where the explanation produced is appreciably different while the meaning and structure of the complex model's output remains similar. This prior work has focused on the existence of these weaknesses but not on their magnitude. Here we explore using an alternate search method with the goal of finding minimum viable perturbations, the fewest perturbations necessary to achieve a fixed similarity value between the original and altered text's explanation. Intuitively, a method that requires fewer perturbations to expose a given level of instability is inferior to one which requires more. This nuance allows for superior comparisons of the stability of explainability methods.
