Table of Contents
Fetching ...

Study on the Helpfulness of Explainable Artificial Intelligence

Tobias Labarta, Elizaveta Kulicheva, Ronja Froelian, Christian Geißler, Xenia Melman, Julian von Klitzing

TL;DR

This work addresses the helpfulness of XAI for human decision-making via the user's ability to successfully perform a proxy task, designed such that a good performance is an indicator for the explanation to provide helpful information.

Abstract

Explainable Artificial Intelligence (XAI) is essential for building advanced machine learning-powered applications, especially in critical domains such as medical diagnostics or autonomous driving. Legal, business, and ethical requirements motivate using effective XAI, but the increasing number of different methods makes it challenging to pick the right ones. Further, as explanations are highly context-dependent, measuring the effectiveness of XAI methods without users can only reveal a limited amount of information, excluding human factors such as the ability to understand it. We propose to evaluate XAI methods via the user's ability to successfully perform a proxy task, designed such that a good performance is an indicator for the explanation to provide helpful information. In other words, we address the helpfulness of XAI for human decision-making. Further, a user study on state-of-the-art methods was conducted, showing differences in their ability to generate trust and skepticism and the ability to judge the rightfulness of an AI decision correctly. Based on the results, we highly recommend using and extending this approach for more objective-based human-centered user studies to measure XAI performance in an end-to-end fashion.

Study on the Helpfulness of Explainable Artificial Intelligence

TL;DR

This work addresses the helpfulness of XAI for human decision-making via the user's ability to successfully perform a proxy task, designed such that a good performance is an indicator for the explanation to provide helpful information.

Abstract

Explainable Artificial Intelligence (XAI) is essential for building advanced machine learning-powered applications, especially in critical domains such as medical diagnostics or autonomous driving. Legal, business, and ethical requirements motivate using effective XAI, but the increasing number of different methods makes it challenging to pick the right ones. Further, as explanations are highly context-dependent, measuring the effectiveness of XAI methods without users can only reveal a limited amount of information, excluding human factors such as the ability to understand it. We propose to evaluate XAI methods via the user's ability to successfully perform a proxy task, designed such that a good performance is an indicator for the explanation to provide helpful information. In other words, we address the helpfulness of XAI for human decision-making. Further, a user study on state-of-the-art methods was conducted, showing differences in their ability to generate trust and skepticism and the ability to judge the rightfulness of an AI decision correctly. Based on the results, we highly recommend using and extending this approach for more objective-based human-centered user studies to measure XAI performance in an end-to-end fashion.

Paper Structure

This paper contains 18 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of the objective methodology for evaluating XAI. It starts with the classification of an image. The model makes a prediction hidden from the user, that will be checked by its correctness in the background. Then, a local post-hoc XAI method is applied to generate an explanation that will be presented to the user, together with the input image and ground truth. Based on this information, the user has to predict the model prediction.
  • Figure 2: Example Survey Question
  • Figure 3: The mean accuracy of participants in determining the correctness of the AI's predictions varied by XAI method used. Confidence Scores consistently outperformed all other methods, while the remaining methods showed minimal to no differences in results.
  • Figure 4: Participants' mean sensitivity & specificity in judging whether the AI was correct in its prediction, per XAI method. SHAP performs best for sensitivity and worst of all methods for specificity. Confidence Scores perform the best in both measures, besides SHAP.
  • Figure 5: Right prediction based on a wrong feature
  • ...and 4 more figures