A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions
Peiyu Yang, Naveed Akhtar, Jiantong Jiang, Ajmal Mian
TL;DR
This work tackles the challenge of evaluating attribution faithfulness when ground-truth attributions are absent. It introduces BackX, a backdoor-based XAI benchmark that uses controllable Trojaned models to derive verifiable ground-truth attributions and a standardized evaluation protocol that mitigates confounding factors. The authors provide theoretical arguments for BackX's superior fidelity, and perform extensive cross-domain benchmarking (vision and language) to reveal characteristic strengths and weaknesses of attribution methods, along with guidance for fair evaluation. They also show how attribution analysis can inform defenses against neural Trojans, highlighting practical implications for trustworthy and safe deployment of interpretable AI in high-stakes settings.
Abstract
Attribution methods compute importance scores for input features to explain model predictions. However, assessing the faithfulness of these methods remains challenging due to the absence of attribution ground truth to model predictions. In this work, we first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill, thereby facilitating a systematic assessment of attribution benchmarks. Next, we introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria. We theoretically establish the superiority of our approach over the existing benchmarks for well-founded attribution evaluation. With extensive analysis, we further establish a standardized evaluation setup that mitigates confounding factors such as post-processing techniques and explained predictions, thereby ensuring a fair and consistent benchmarking. This setup is ultimately employed for a comprehensive comparison of existing methods using BackX. Finally, our analysis also offers insights into defending against neural Trojans by utilizing the attributions.
