On the Evaluation Consistency of Attribution-based Explanations

Jiarui Duan; Haoling Li; Haofei Zhang; Hao Jiang; Mengqi Xue; Li Sun; Mingli Song; Jie Song

On the Evaluation Consistency of Attribution-based Explanations

Jiarui Duan, Haoling Li, Haofei Zhang, Hao Jiang, Mengqi Xue, Li Sun, Mingli Song, Jie Song

TL;DR

This work tackles the lack of consistent evaluation for attribution-based explanations in XAI by introducing Meta-Rank, an open benchmarking platform that jointly assesses eight attribution methods across four datasets and six CNN architectures using MoRF and LeRF protocols. Meta-Rank formalizes a cross-case leaderboard by computing pairwise method rankings over diverse test cases with a collective metric built from $P_{q \prec p}$ and its Logit transform to derive $\kappa$ differences, enabling robust, scalable comparisons. Through extensive experiments, it reveals that evaluation settings dramatically influence rankings, that rankings remain fairly stable across training checkpoints, and that previous cross-case approaches like ROAD fall short in heterogeneous scenarios. The results favor Input⊙Gradient and Integrated Gradients while highlighting Deconvolution as least reliable, and confirm the framework’s efficiency and flexibility for broad future use, with code available at the project site.

Abstract

Attribution-based explanations are garnering increasing attention recently and have emerged as the predominant approach towards \textit{eXplanable Artificial Intelligence}~(XAI). However, the absence of consistent configurations and systematic investigations in prior literature impedes comprehensive evaluations of existing methodologies. In this work, we introduce {Meta-Rank}, an open platform for benchmarking attribution methods in the image domain. Presently, Meta-Rank assesses eight exemplary attribution methods using six renowned model architectures on four diverse datasets, employing both the \textit{Most Relevant First} (MoRF) and \textit{Least Relevant First} (LeRF) evaluation protocols. Through extensive experimentation, our benchmark reveals three insights in attribution evaluation endeavors: 1) evaluating attribution methods under disparate settings can yield divergent performance rankings; 2) although inconsistent across numerous cases, the performance rankings exhibit remarkable consistency across distinct checkpoints along the same training trajectory; 3) prior attempts at consistent evaluation fare no better than baselines when extended to more heterogeneous models and datasets. Our findings underscore the necessity for future research in this domain to conduct rigorous evaluations encompassing a broader range of models and datasets, and to reassess the assumptions underlying the empirical success of different attribution methods. Our code is publicly available at \url{https://github.com/TreeThree-R/Meta-Rank}.

On the Evaluation Consistency of Attribution-based Explanations

TL;DR

and its Logit transform to derive

differences, enabling robust, scalable comparisons. Through extensive experiments, it reveals that evaluation settings dramatically influence rankings, that rankings remain fairly stable across training checkpoints, and that previous cross-case approaches like ROAD fall short in heterogeneous scenarios. The results favor Input⊙Gradient and Integrated Gradients while highlighting Deconvolution as least reliable, and confirm the framework’s efficiency and flexibility for broad future use, with code available at the project site.

Abstract

Paper Structure (13 sections, 6 equations, 4 figures, 4 tables)

This paper contains 13 sections, 6 equations, 4 figures, 4 tables.

Introduction
Related Work
Preliminaries
Meta-Rank Settings and Benchmark
Standardized Settings
The Proposed Benchmark Metric
Main Results
RQ1: Consistency Investigation
RQ2: Necessity of Meta-Rank
RQ3: Attribution Evaluation with Meta-Rank
RQ4: Efficiency of Meta-Rank
Discussion
Conclusion

Figures (4)

Figure 1: Meta-Rank benchmark. It is mainly divided into two stages: Test Case Generation and Meta-Rank (a). Test Case Generation: multiple factors (i.e., datasets, models and evaluation protocols) are combined to generate $\tau$ different cases. Meta-Rank: (1) Case Execution. All $m$ competitors (i.e., attribution methods) are applied to these cases, resulting in a collection of rankings. The details of attribution evaluation on an individual case are provided in (b). (2) Ranking Fusion. All rankings {$R_1$, $R_2$, $\ldots$, $R_\tau$} are subsequently fed into this module. The comparison of two competitors is transformed into the differences in their rankings across all cases, then integrated and converted into the discrepancy in Meta-Ranks. (3) Leaderboard. Ultimately, a unified leaderboard is obtained based on the Meta-Ranks of the competitors. "$T$" is the test case, "$C$" is the competitor, and "$\kappa$" is the Meta-Rank value.
Figure 2: Evaluation results on 12 test cases. These cases consist of two datasets: Food-101 (a)(b)(e)(f)(i)(j), ImageNet-1k (c)(d)(g)(h)(k)(l), three models: ResNet-18 (a)(b)(c)(d), Inception-v4 (e)(f)(g)(h), VGG-19 (i)(j)(k)(l), and two protocols: MoRF(a)(c)(e)(g)(i)(k), LeRF(b)(d)(f)(h)(j)(l). "Baseline" represents the accuracy of the model when no features are ablated.
Figure 3: Spearman correlation among nine test cases (NWPU, Food, and ImageNet datasets paired with ResNet-18, Inception-v4, and VGG-19 models) in MoRF (a) and LeRF (b), and between MoRF and LeRF on the same nine cases (c). The labels along the horizontal arrow are equivalent to those along the vertical arrow. Here, we define a correlation score higher than 0.8 as strongly correlated (), between 0.6 and 0.8 as moderately correlated (), and lower than 0.6 as weakly correlated ().
Figure 4: Time consumption of the "Ranking Fusion" module in Meta-Rank. $2$, $4$, $8$, $16$, $32$, and $64$ are the number of evaluated attribution methods. The gray bar represents the relative error.

On the Evaluation Consistency of Attribution-based Explanations

TL;DR

Abstract

On the Evaluation Consistency of Attribution-based Explanations

Authors

TL;DR

Abstract

Table of Contents

Figures (4)