Human-Centered Evaluation of XAI Methods

Karam Dawoud; Wojciech Samek; Peter Eisert; Sebastian Lapuschkin; Sebastian Bosse

Human-Centered Evaluation of XAI Methods

Karam Dawoud, Wojciech Samek, Peter Eisert, Sebastian Lapuschkin, Sebastian Bosse

TL;DR

The paper tackles the challenge of making AI explanations interpretable to end users by directly comparing three local XAI methods (ProtoPNet, Occlusion, LRP) against human-generated baselines in a controlled image-recognition task. Using a ClickMe-based human baseline and a dataset of 102 images, the study shows that while the highlighted regions differ across methods, the resulting interpretability is broadly similar for humans, and no single method consistently outperforms others. ProtoPNet offers higher classification accuracy but requires more image exposure, whereas LRP provides efficient, informative explanations with lower exposure; Occlusion lags in recognition performance but offers simple, fast explanations. The findings support a human-centered approach to XAI evaluation and suggest adaptive method selection or recommender systems to tailor explanations to users and tasks, with implications for deploying transparent AI in practice.

Abstract

In the ever-evolving field of Artificial Intelligence, a critical challenge has been to decipher the decision-making processes within the so-called "black boxes" in deep learning. Over recent years, a plethora of methods have emerged, dedicated to explaining decisions across diverse tasks. Particularly in tasks like image classification, these methods typically identify and emphasize the pivotal pixels that most influence a classifier's prediction. Interestingly, this approach mirrors human behavior: when asked to explain our rationale for classifying an image, we often point to the most salient features or aspects. Capitalizing on this parallel, our research embarked on a user-centric study. We sought to objectively measure the interpretability of three leading explanation methods: (1) Prototypical Part Network, (2) Occlusion, and (3) Layer-wise Relevance Propagation. Intriguingly, our results highlight that while the regions spotlighted by these methods can vary widely, they all offer humans a nearly equivalent depth of understanding. This enables users to discern and categorize images efficiently, reinforcing the value of these methods in enhancing AI transparency.

Human-Centered Evaluation of XAI Methods

TL;DR

Abstract

Paper Structure (8 sections, 10 figures, 4 tables)

This paper contains 8 sections, 10 figures, 4 tables.

Introduction
Related Work
Experiment Setup
Evaluation experiment-Specific
Primary findings
Correlation between the explanation methods and ClickMe
Individualized XAI Methods
Discussion

Figures (10)

Figure 2: Left image; autocomplete list drops and changes dynamically with each letter added. The participant can confirm the choice with the Enter key or a simple mouse click. Right image; feedback confirms that the answer was correct and states how many rounds are finished.
Figure 3: These are simplified examples of ClickMe relevance heatmaps, with bubble sizes of $3 \times 3$. Left to right: in the first image, four bubbles were arranged in a "circle" with the center being the most important. In the second image, five bubbles were placed to draw a line from the top left to the bottom right, with the bubbles spaced one square apart (the center of the "line" is more relevant). Finally, the third image shows four bubbles placed between the bottom left and top right, noticeably some "salt and peepers" artifacts will exist due to fast movement or long sampling time.
Figure 4: Violin plots show how the pixel counts are distributed for each method at each revelation step; while ClickMe pixel counts vary with each image, the counts for LRP and Occlusion are lower bound by 10 percent of the entire image (in the last revelation step). Still, they follow the counts of ClickMe if it exceeds the 10 percent threshold. As for ProtoPNet, the count at the last revelation step is set to one-sixth of the image .
Figure 5: Left: an example image of a ferret is uncovered using different methods; since ClickMe pixel numbers are less than 10 percent of the image, LRP and Occlusion are set to uncover up to their lower bound, and ProtoPNet always uncover up to one-sixth of the image. Right: an image of a water jug is uncovered using different methods, where the number of pixels uncovered by ClickMe exceeds 10 percent of the image. LRP and Occlusion are set to follow the uncovering curve of ClickMe, while ProtoPNet always uncovers up to one-sixth of the image.
Figure 6: Experiment method's pdf over the number of pixels needed for image recognition: ClickMe, Occlusion, and LRP are similarly right-skewed, meaning that the most important features revealed toward the beginning of each round are indeed enough for the participants to recognize the right class. While ProtoPNet is somewhat right-skewed and smooth spread, this can be explained by the upscaling of the activation map, which may lead to the center of the activation outside of the actual object, thus when start opening it needs more revelation steps (more pixels) before the actual object is uncovered for this method.
...and 5 more figures

Human-Centered Evaluation of XAI Methods

TL;DR

Abstract

Human-Centered Evaluation of XAI Methods

Authors

TL;DR

Abstract

Table of Contents

Figures (10)