On the Evaluation Consistency of Attribution-based Explanations
Jiarui Duan, Haoling Li, Haofei Zhang, Hao Jiang, Mengqi Xue, Li Sun, Mingli Song, Jie Song
TL;DR
This work tackles the lack of consistent evaluation for attribution-based explanations in XAI by introducing Meta-Rank, an open benchmarking platform that jointly assesses eight attribution methods across four datasets and six CNN architectures using MoRF and LeRF protocols. Meta-Rank formalizes a cross-case leaderboard by computing pairwise method rankings over diverse test cases with a collective metric built from $P_{q \prec p}$ and its Logit transform to derive $\kappa$ differences, enabling robust, scalable comparisons. Through extensive experiments, it reveals that evaluation settings dramatically influence rankings, that rankings remain fairly stable across training checkpoints, and that previous cross-case approaches like ROAD fall short in heterogeneous scenarios. The results favor Input⊙Gradient and Integrated Gradients while highlighting Deconvolution as least reliable, and confirm the framework’s efficiency and flexibility for broad future use, with code available at the project site.
Abstract
Attribution-based explanations are garnering increasing attention recently and have emerged as the predominant approach towards \textit{eXplanable Artificial Intelligence}~(XAI). However, the absence of consistent configurations and systematic investigations in prior literature impedes comprehensive evaluations of existing methodologies. In this work, we introduce {Meta-Rank}, an open platform for benchmarking attribution methods in the image domain. Presently, Meta-Rank assesses eight exemplary attribution methods using six renowned model architectures on four diverse datasets, employing both the \textit{Most Relevant First} (MoRF) and \textit{Least Relevant First} (LeRF) evaluation protocols. Through extensive experimentation, our benchmark reveals three insights in attribution evaluation endeavors: 1) evaluating attribution methods under disparate settings can yield divergent performance rankings; 2) although inconsistent across numerous cases, the performance rankings exhibit remarkable consistency across distinct checkpoints along the same training trajectory; 3) prior attempts at consistent evaluation fare no better than baselines when extended to more heterogeneous models and datasets. Our findings underscore the necessity for future research in this domain to conduct rigorous evaluations encompassing a broader range of models and datasets, and to reassess the assumptions underlying the empirical success of different attribution methods. Our code is publicly available at \url{https://github.com/TreeThree-R/Meta-Rank}.
