Table of Contents
Fetching ...

Explainable AI improves task performance in human-AI collaboration

Julian Senoner, Simon Schallmoser, Bernhard Kratzwald, Stefan Feuerriegel, Torbjørn Netland

TL;DR

This work hypothesizes that augmenting humans with explainable AI improves task performance in human–AI collaboration and implements explainable AI in the form of visual heatmaps in inspection tasks conducted by domain experts.

Abstract

Artificial intelligence (AI) provides considerable opportunities to assist human work. However, one crucial challenge of human-AI collaboration is that many AI algorithms operate in a black-box manner where the way how the AI makes predictions remains opaque. This makes it difficult for humans to validate a prediction made by AI against their own domain knowledge. For this reason, we hypothesize that augmenting humans with explainable AI as a decision aid improves task performance in human-AI collaboration. To test this hypothesis, we analyze the effect of augmenting domain experts with explainable AI in the form of visual heatmaps. We then compare participants that were either supported by (a) black-box AI or (b) explainable AI, where the latter supports them to follow AI predictions when the AI is accurate or overrule the AI when the AI predictions are wrong. We conducted two preregistered experiments with representative, real-world visual inspection tasks from manufacturing and medicine. The first experiment was conducted with factory workers from an electronics factory, who performed $N=9,600$ assessments of whether electronic products have defects. The second experiment was conducted with radiologists, who performed $N=5,650$ assessments of chest X-ray images to identify lung lesions. The results of our experiments with domain experts performing real-world tasks show that task performance improves when participants are supported by explainable AI instead of black-box AI. For example, in the manufacturing setting, we find that augmenting participants with explainable AI (as opposed to black-box AI) leads to a five-fold decrease in the median error rate of human decisions, which gives a significant improvement in task performance.

Explainable AI improves task performance in human-AI collaboration

TL;DR

This work hypothesizes that augmenting humans with explainable AI improves task performance in human–AI collaboration and implements explainable AI in the form of visual heatmaps in inspection tasks conducted by domain experts.

Abstract

Artificial intelligence (AI) provides considerable opportunities to assist human work. However, one crucial challenge of human-AI collaboration is that many AI algorithms operate in a black-box manner where the way how the AI makes predictions remains opaque. This makes it difficult for humans to validate a prediction made by AI against their own domain knowledge. For this reason, we hypothesize that augmenting humans with explainable AI as a decision aid improves task performance in human-AI collaboration. To test this hypothesis, we analyze the effect of augmenting domain experts with explainable AI in the form of visual heatmaps. We then compare participants that were either supported by (a) black-box AI or (b) explainable AI, where the latter supports them to follow AI predictions when the AI is accurate or overrule the AI when the AI predictions are wrong. We conducted two preregistered experiments with representative, real-world visual inspection tasks from manufacturing and medicine. The first experiment was conducted with factory workers from an electronics factory, who performed assessments of whether electronic products have defects. The second experiment was conducted with radiologists, who performed assessments of chest X-ray images to identify lung lesions. The results of our experiments with domain experts performing real-world tasks show that task performance improves when participants are supported by explainable AI instead of black-box AI. For example, in the manufacturing setting, we find that augmenting participants with explainable AI (as opposed to black-box AI) leads to a five-fold decrease in the median error rate of human decisions, which gives a significant improvement in task performance.
Paper Structure (30 sections, 2 equations, 14 figures, 23 tables)

This paper contains 30 sections, 2 equations, 14 figures, 23 tables.

Figures (14)

  • Figure 1: Overview of the experiments for assessing the effect of explainable AI on task performance. (A) Experimental design of the manufacturing experiment where factory workers were asked to approve images of faultless products and to reject images of defective products through a computer interface. (B) Experimental design of the medical experiment where radiologists were asked to decide whether lung lesions are visible in the chest X-ray image. In both experiments, participants were randomly assigned to one of the two treatments: (a) black-box AI or (b) explainable AI.
  • Figure 2: Results of manufacturing experiment. The boxplots compare the task performance between the two treatments: black-box AI and explainable AI. The task performance is measured by the balanced accuracy (A) and the defect detection rate (B) based on the quality assessment of workers and the ground-truth labels of the product images. A balanced accuracy of 50% provides a naïve baseline corresponding to a random guess (black dotted line). The standalone AI algorithm attains a balanced accuracy of 95.6% and a defect detection rate of 92.9% (orange dashed lines). Statistical significance is based on a one-sided Welch's $t$-test (***$P<0.001$, **$P<0.01$, *$P<0.05$). In the boxplots, the center line denotes the median; box limits are upper and lower quartiles; whiskers are defined as the 1.5x interquartile range.
  • Figure 3: Results of medical experiment. The boxplots compare the task performance between the two treatments: black-box AI and explainable AI. The task performance is measured by the balanced accuracy (A) and the disease detection rate (B) based on the quality assessment of radiologists and the ground-truth labels of the chest X-ray images. A balanced accuracy of 50% provides a naïve baseline corresponding to a random guess (black dotted line). The standalone AI algorithm attains a balanced accuracy of 82.2% and a disease detection rate of 71.4% (orange dashed lines). Statistical significance is based on a one-sided Welch's $t$-test (***$P<0.001$, **$P<0.01$, *$P<0.05$). In the boxplots, the center line denotes the median; box limits are upper and lower quartiles; whiskers are defined as the 1.5x interquartile range.
  • Figure S1: Four types of electronic products (printed circuit boards). (A-D) Exemplary images of faultless products that were inspected during the experiment.
  • Figure S2: Examples of quality defects. (A) Example of a defective product with wrong components. (B) Example of a defective product with a component assembled in the wrong orientation. (C) Example of a defective product with a faulty component.
  • ...and 9 more figures