A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions

Peiyu Yang; Naveed Akhtar; Jiantong Jiang; Ajmal Mian

A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions

Peiyu Yang, Naveed Akhtar, Jiantong Jiang, Ajmal Mian

TL;DR

This work tackles the challenge of evaluating attribution faithfulness when ground-truth attributions are absent. It introduces BackX, a backdoor-based XAI benchmark that uses controllable Trojaned models to derive verifiable ground-truth attributions and a standardized evaluation protocol that mitigates confounding factors. The authors provide theoretical arguments for BackX's superior fidelity, and perform extensive cross-domain benchmarking (vision and language) to reveal characteristic strengths and weaknesses of attribution methods, along with guidance for fair evaluation. They also show how attribution analysis can inform defenses against neural Trojans, highlighting practical implications for trustworthy and safe deployment of interpretable AI in high-stakes settings.

Abstract

Attribution methods compute importance scores for input features to explain model predictions. However, assessing the faithfulness of these methods remains challenging due to the absence of attribution ground truth to model predictions. In this work, we first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill, thereby facilitating a systematic assessment of attribution benchmarks. Next, we introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria. We theoretically establish the superiority of our approach over the existing benchmarks for well-founded attribution evaluation. With extensive analysis, we further establish a standardized evaluation setup that mitigates confounding factors such as post-processing techniques and explained predictions, thereby ensuring a fair and consistent benchmarking. This setup is ultimately employed for a comprehensive comparison of existing methods using BackX. Finally, our analysis also offers insights into defending against neural Trojans by utilizing the attributions.

A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions

TL;DR

Abstract

Paper Structure (41 sections, 2 theorems, 11 equations, 29 figures, 8 tables)

This paper contains 41 sections, 2 theorems, 11 equations, 29 figures, 8 tables.

Introduction
On Benchmark Fidelity
Fidelity Criteria
Fidelity Comparison
BackX Benchmark
Benchmark Framework
Evaluation Metrics
Benchmark Fidelity Examination
Standardized Attribution Evaluation
Post-processing Choice
Output Choice
Benchmarking
Backdoor Defense with Attributions
Conclusion
Appendix
...and 26 more sections

Key Result

Proposition 1

Let $f$ be a model that assigns the label $y$ to any input $x$ and to all inputs $\bar{x}$ within an $\epsilon$-ball under metric $\Omega$, i.e., $f(\bar{x}) = f(x)$ whenever $\Omega(x, \bar{x}) \leq \epsilon$. Suppose a Trojaned input is given by $\tilde{x} = x + v$, where $\Omega(x, \tilde{x}) \le

Figures (29)

Figure 1: Illustration of fidelity criteria. For explaining (a) the model, a faithful XAI benchmark should avoid (b) Functional Mapping Shift, and (c) Input Distribution Shift, while ensuring (d) Attribution Verifiability, and (e) Metric Sensitivity. See § \ref{['sec:fidCri']} for explanations.
Figure 2: The pipeline of BackX. Step 1 embeds a backdoor into a benign model by retraining it on a poisoned set. Step 2 uses the Trojaned model to generate predictions. Step 3 explains the predictions via attribution methods. Step 4 recovers a sample from a poisoned sample by replacing its pixels from a clean sample, as guided by the attribution mask. Step 5 assesses attribution methods.
Figure 3: The performance comparison of CAM-based, gradient-based and integrated-based attribution methods on CIFAR-10, GTSRB and ImageNet using BackX benchmark. (a) Difference in Attack Success Rate between benchmarking absolute values (abs.) and original values (org.) of attributions is calculated. (b) Trigger Recall difference with and without taking absolute values.
Figure 4: The performance comparison of attribution methods for output choice. (a) Difference between attack success rate when attributions are computed for softmax probabilities (prob.) and logits. (b) Trigger recall difference using softmax probabilities and logits.
Figure 5: Benchmarking of attribution methods using the attack success rate metric. Lower is better.
...and 24 more figures

Theorems & Definitions (4)

Proposition 1
Proposition 2
proof : Proof of Proposition 1
proof : Proof of Proposition 2

A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions

TL;DR

Abstract

A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (29)

Theorems & Definitions (4)