Classification Metrics for Image Explanations: Towards Building Reliable XAI-Evaluations

Benjamin Fresz; Lena Lörcher; Marco Huber

Classification Metrics for Image Explanations: Towards Building Reliable XAI-Evaluations

Benjamin Fresz, Lena Lörcher, Marco Huber

TL;DR

This work tackles the challenge of evaluating image saliency explanations without ground-truth pixel labels by introducing a mosaic-based framework that defines true/false feature importance for positives and negatives. It extends the Focus score with negative FI and additional metrics, and assesses metric validity through psychometric-inspired reliability analyses using Krippendorff's α and Spearman's ρ. Benchmarking across datasets (including Cars/Cats, Mountain Dogs, and ImageNet) and saliency methods (including SHAP and B-cos) reveals that method performance is highly dependent on the model and dataset, with no single approach excelling universally. The study provides open-source code, a rigorous reliability-oriented evaluation protocol, and practical guidance for selecting XAI methods in real-world use cases, while acknowledging limitations and encouraging further expansion of objective XAI metrics.

Abstract

Decision processes of computer vision models - especially deep neural networks - are opaque in nature, meaning that these decisions cannot be understood by humans. Thus, over the last years, many methods to provide human-understandable explanations have been proposed. For image classification, the most common group are saliency methods, which provide (super-)pixelwise feature attribution scores for input images. But their evaluation still poses a problem, as their results cannot be simply compared to the unknown ground truth. To overcome this, a slew of different proxy metrics have been defined, which are - as the explainability methods themselves - often built on intuition and thus, are possibly unreliable. In this paper, new evaluation metrics for saliency methods are developed and common saliency methods are benchmarked on ImageNet. In addition, a scheme for reliability evaluation of such metrics is proposed that is based on concepts from psychometric testing. The used code can be found at https://github.com/lelo204/ClassificationMetricsForImageExplanations .

Classification Metrics for Image Explanations: Towards Building Reliable XAI-Evaluations

TL;DR

Abstract

Paper Structure (31 sections, 11 figures, 7 tables)

This paper contains 31 sections, 11 figures, 7 tables.

Introduction
Related Work
Saliency Methods
XAI Evaluation
Methodology
Proposed Metrics
Mosaics
True and False Feature Importance
Evaluation Approach
Inter-rater Reliability
Inter-method Reliability
Datasets
Corner Cases with Small Datasets
Easy to Distinguish Classes
Difficult to Distinguish Classes
...and 16 more sections

Figures (11)

Figure 1: One sample mosaic for each of the regarded datasets (cf. Section \ref{['sec:experiments']}). On the left the mosaic comprises the ImageNet classes "tabby" and "sports car", in the middle "Bernese Mountain Dog" and "Greater Swiss Mountain Dog", and on the right the classes "lorikeet", "mashed potato", and "American chameleon".
Figure 2: One sample heatmap by each saliency method for the first mosaic shown in Figure \ref{['fig:mosaic_samples']} of the Cars/Cats dataset. The explanations are created for ResNet50 for the target class "tabby". The upper row shows heatmaps for methods providing positive and negative FI, the lower one for methods with only positive FI. LIME uses a binary mask to highlight relevant image pieces, thus a binary masking of the original image is shown here. Similar results for VGG11 are presented in Figure \ref{['fig:mosaic_results_vgg11']} in the appendix.
Figure 3: Exemplary results for precision and specificity for ResNet50 on the datasets with easier and more difficult to distinguish classes. Higher values are better. Note that specificity can only be calculated for methods which provide negative FI.
Figure 4: Visualization of the execution time of the different saliency methods in seconds. The time required to generate the saliency maps of every mosaic in the ImageNet dataset (cf. Subsection \ref{['ssec:imagenet']}) was measured.
Figure 5: One sample heatmap by each saliency method for the first mosaic shown in Figure \ref{['fig:mosaic_samples']} of the Cars/Cats dataset, here for VGG11. The upper row shows heatmaps for methods with positive and negative FI, the lower one for methods with only positive FI. LIME uses a binary mask to highlight relevant image pieces, thus a binary masking of the original image is shown here. Note the differences to the explanations for the same image for ResNet50 in Figure \ref{['fig:mosaic_results']}.
...and 6 more figures

Classification Metrics for Image Explanations: Towards Building Reliable XAI-Evaluations

TL;DR

Abstract

Classification Metrics for Image Explanations: Towards Building Reliable XAI-Evaluations

Authors

TL;DR

Abstract

Table of Contents

Figures (11)