Table of Contents
Fetching ...

Confident Teacher, Confident Student? A Novel User Study Design for Investigating the Didactic Potential of Explanations and their Impact on Uncertainty

Teodor Chiaburu, Frank Haußer, Felix Bießmann

TL;DR

It is found that users' annotations are not significantly better after performing annotation with AI assistance, which suggests that explanations in visual human-AI collaboration do not appear to induce lasting learning effects.

Abstract

Evaluating the quality of explanations in Explainable Artificial Intelligence (XAI) is to this day a challenging problem, with ongoing debate in the research community. While some advocate for establishing standardized offline metrics, others emphasize the importance of human-in-the-loop (HIL) evaluation. Here we propose an experimental design to evaluate the potential of XAI in human-AI collaborative settings as well as the potential of XAI for didactics. In a user study with 1200 participants we investigate the impact of explanations on human performance on a challenging visual task - annotation of biological species in complex taxonomies. Our results demonstrate the potential of XAI in complex visual annotation tasks: users become more accurate in their annotations and demonstrate less uncertainty with AI assistance. The increase in accuracy was, however, not significantly different when users were shown the mere prediction of the model compared to when also providing an explanation. We also find negative effects of explanations: users tend to replicate the model's predictions more often when shown explanations, even when those predictions are wrong. When evaluating the didactic effects of explanations in collaborative human-AI settings, we find that users' annotations are not significantly better after performing annotation with AI assistance. This suggests that explanations in visual human-AI collaboration do not appear to induce lasting learning effects. All code and experimental data can be found in our GitHub repository: https://github.com/TeodorChiaburu/beexplainable.

Confident Teacher, Confident Student? A Novel User Study Design for Investigating the Didactic Potential of Explanations and their Impact on Uncertainty

TL;DR

It is found that users' annotations are not significantly better after performing annotation with AI assistance, which suggests that explanations in visual human-AI collaboration do not appear to induce lasting learning effects.

Abstract

Evaluating the quality of explanations in Explainable Artificial Intelligence (XAI) is to this day a challenging problem, with ongoing debate in the research community. While some advocate for establishing standardized offline metrics, others emphasize the importance of human-in-the-loop (HIL) evaluation. Here we propose an experimental design to evaluate the potential of XAI in human-AI collaborative settings as well as the potential of XAI for didactics. In a user study with 1200 participants we investigate the impact of explanations on human performance on a challenging visual task - annotation of biological species in complex taxonomies. Our results demonstrate the potential of XAI in complex visual annotation tasks: users become more accurate in their annotations and demonstrate less uncertainty with AI assistance. The increase in accuracy was, however, not significantly different when users were shown the mere prediction of the model compared to when also providing an explanation. We also find negative effects of explanations: users tend to replicate the model's predictions more often when shown explanations, even when those predictions are wrong. When evaluating the didactic effects of explanations in collaborative human-AI settings, we find that users' annotations are not significantly better after performing annotation with AI assistance. This suggests that explanations in visual human-AI collaboration do not appear to induce lasting learning effects. All code and experimental data can be found in our GitHub repository: https://github.com/TeodorChiaburu/beexplainable.
Paper Structure (11 sections, 5 figures, 1 table)

This paper contains 11 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Prototypical examples of the three wild bee species used for our HIL experiment. These examples were also shown to the participants in the instructions at the beginning of the trial. The difficulty in distinguishing the three species from one another consists in morphological features present on the bees' thorax and abdomen: A. bicolor has a fuzzy orange thorax and a shiny brown abdomen; A. flavipes has a fuzzy brown thorax and shiny brown abdomen; A. fulva's thorax and abdomen are both fuzzy orange.
  • Figure 2: Experimental Design. From left to right: Task 1 - users classify images on their own; Task 2 - further AI assistance is provided, presented differently depending on the assigned group; Task 3 - users again classify images alone.
  • Figure 3: AI assistance improves users' performance, but showing only AI predictions helps as much as showing explanations. Throughout all the 6 human-AI collaboration conditions we notice higher user accuracies in Task 2 (where AI hints were provided) in the mean and median, as opposed to Tasks 1 and 3 (both without AI assistance). In Task 3 participants are not performing any better than in Task 1, suggesting that there is no substantial didactic effect derived from the assistive AI in Task 2. Moreover, there is also no noticeable difference between the two control groups and the four explanation groups. This suggests that explanations were not more effective than simply reporting the model's prediction.
  • Figure 4: Explanations improve user accuracy in human-AI collaboration (Task 2). Every dot represents one of the 45 images with corresponding user metrics averaged over all users' responses. (a) Task 2 (with AI assistance): Throughout all groups, samples classified with high certainty by the model were also classified with high accuracy by users. These samples are also associated with high acceptance rate for the AI suggestion. (b) Tasks 1+3 (without AI assistance): User accuracy is lower than in Task 2, regardless of whether the model was certain or not. The user-AI agreement rate is also notably lower.
  • Figure 5: Uncertainty of users' responses decreases when explanations are shown (Task 2) compared to no AI assistance (Task 1+3). Every dot represents one of the 45 images with corresponding user metrics averaged over all users' responses. (a) Task 2 (with AI assistance): Independent of model uncertainty, the users' uncertainty is near 0 for most samples. User-AI agreement is also high with these samples. (b) Tasks 1+3 (without AI assistance): Users are less certain than without AI assistance (Task 2), regardless of whether the model was certain or not. Also the acceptance rate of the AI's suggestions is lower.