Table of Contents
Fetching ...

Fool Me Once? Contrasting Textual and Visual Explanations in a Clinical Decision-Support Setting

Maxime Kayser, Bayar Menzat, Cornelius Emde, Bogdan Bercean, Alex Novak, Abdala Espinosa, Bartlomiej W. Papiez, Susanne Gaube, Thomas Lukasiewicz, Oana-Maria Camburu

TL;DR

It is found that text-based explanations lead to significant over-reliance, which is alleviated by combining them with saliency maps, and the quality of explanations, that is, how much factually correct information they entail, and how much this aligns with AI correctness, significantly impacts the usefulness of the different explanation types.

Abstract

The growing capabilities of AI models are leading to their wider use, including in safety-critical domains. Explainable AI (XAI) aims to make these models safer to use by making their inference process more transparent. However, current explainability methods are seldom evaluated in the way they are intended to be used: by real-world end users. To address this, we conducted a large-scale user study with 85 healthcare practitioners in the context of human-AI collaborative chest X-ray analysis. We evaluated three types of explanations: visual explanations (saliency maps), natural language explanations, and a combination of both modalities. We specifically examined how different explanation types influence users depending on whether the AI advice and explanations are factually correct. We find that text-based explanations lead to significant over-reliance, which is alleviated by combining them with saliency maps. We also observe that the quality of explanations, that is, how much factually correct information they entail, and how much this aligns with AI correctness, significantly impacts the usefulness of the different explanation types.

Fool Me Once? Contrasting Textual and Visual Explanations in a Clinical Decision-Support Setting

TL;DR

It is found that text-based explanations lead to significant over-reliance, which is alleviated by combining them with saliency maps, and the quality of explanations, that is, how much factually correct information they entail, and how much this aligns with AI correctness, significantly impacts the usefulness of the different explanation types.

Abstract

The growing capabilities of AI models are leading to their wider use, including in safety-critical domains. Explainable AI (XAI) aims to make these models safer to use by making their inference process more transparent. However, current explainability methods are seldom evaluated in the way they are intended to be used: by real-world end users. To address this, we conducted a large-scale user study with 85 healthcare practitioners in the context of human-AI collaborative chest X-ray analysis. We evaluated three types of explanations: visual explanations (saliency maps), natural language explanations, and a combination of both modalities. We specifically examined how different explanation types influence users depending on whether the AI advice and explanations are factually correct. We find that text-based explanations lead to significant over-reliance, which is alleviated by combining them with saliency maps. We also observe that the quality of explanations, that is, how much factually correct information they entail, and how much this aligns with AI correctness, significantly impacts the usefulness of the different explanation types.

Paper Structure

This paper contains 45 sections, 5 equations, 25 figures, 2 tables.

Figures (25)

  • Figure 1: The flow of the user study that every participant goes through.
  • Figure 2: (a) Revealing ($C_{AI} = 0$, low $C_\chi$): The AI incorrectly suggests atelectasis, but the poorly rated explanations help clinicians identify the error, leading to higher accuracy compared to relying on the AI prediction alone. (b) Confusing ($C_{AI} = 1$, low $C_\chi$): The AI correctly identifies aspiration but provides low $C_\chi$ explanations3, leading to lower diagnostic accuracy compared to the No XAI setting. (c) Misleading ($C_{AI} = 0$, high $C_\chi$): The AI incorrectly suggests alveolar haemorrhage but provides highly rated explanations, misleading participants to agree with the incorrect AI when explanations are provided. (d) Convincing ($C_{AI} = 1$, high $C_\chi$): The AI correctly identifies pneumonia and provides highly rated explanations, resulting in high diagnostic accuracy, especially for NLEs.
  • Figure 3: Human accuracy given $C_{AI}$ and $C_\chi$, predicted with the model \ref{['eq:model-definition']}.
  • Figure 4: Five attributes of explainability methods, ranked on a 7-point Likert scale.
  • Figure 5: The bar charts represent model-based predictions of human accuracy under different conditions. For example, the model predicts a 76.5% "expected probability" of correct user decisions for "insightful explanations" with NLEs (top-left plot). $p$-values are derived from hypothesis testing, comparing human accuracy between explanation types for specific data subsets. The error bars represent standard errors. $\cdot$, *, ** ($p<0.1$, $0.05$, $0.01$)
  • ...and 20 more figures