Table of Contents
Fetching ...

Unraveling the Dilemma of AI Errors: Exploring the Effectiveness of Human and Machine Explanations for Large Language Models

Marvin Pafla, Kate Larson, Mark Hancock

TL;DR

The paper assesses the efficacy of human versus machine explanations for Large Language Models in a QA setting using SQuAD-based tasks. It collects 156 human explanations and contrasts them with machine explanations from integrated gradients, conservative LRP, and ChatGPT, evaluating them in a large online study (N=136) across correct and incorrect AI outputs. Findings show human saliency explanations are more helpful than machine ones, yet explainability can decrease performance when explanations accompany incorrect AI predictions, revealing an AI-explanation dilemma and confirmation bias risk. The study offers design and research recommendations to improve XAI practice, emphasizing the importance of including incorrect predictions in evaluations and framing explanations as exploratory aids rather than definitive explanations.

Abstract

The field of eXplainable artificial intelligence (XAI) has produced a plethora of methods (e.g., saliency-maps) to gain insight into artificial intelligence (AI) models, and has exploded with the rise of deep learning (DL). However, human-participant studies question the efficacy of these methods, particularly when the AI output is wrong. In this study, we collected and analyzed 156 human-generated text and saliency-based explanations collected in a question-answering task (N=40) and compared them empirically to state-of-the-art XAI explanations (integrated gradients, conservative LRP, and ChatGPT) in a human-participant study (N=136). Our findings show that participants found human saliency maps to be more helpful in explaining AI answers than machine saliency maps, but performance negatively correlated with trust in the AI model and explanations. This finding hints at the dilemma of AI errors in explanation, where helpful explanations can lead to lower task performance when they support wrong AI predictions.

Unraveling the Dilemma of AI Errors: Exploring the Effectiveness of Human and Machine Explanations for Large Language Models

TL;DR

The paper assesses the efficacy of human versus machine explanations for Large Language Models in a QA setting using SQuAD-based tasks. It collects 156 human explanations and contrasts them with machine explanations from integrated gradients, conservative LRP, and ChatGPT, evaluating them in a large online study (N=136) across correct and incorrect AI outputs. Findings show human saliency explanations are more helpful than machine ones, yet explainability can decrease performance when explanations accompany incorrect AI predictions, revealing an AI-explanation dilemma and confirmation bias risk. The study offers design and research recommendations to improve XAI practice, emphasizing the importance of including incorrect predictions in evaluations and framing explanations as exploratory aids rather than definitive explanations.

Abstract

The field of eXplainable artificial intelligence (XAI) has produced a plethora of methods (e.g., saliency-maps) to gain insight into artificial intelligence (AI) models, and has exploded with the rise of deep learning (DL). However, human-participant studies question the efficacy of these methods, particularly when the AI output is wrong. In this study, we collected and analyzed 156 human-generated text and saliency-based explanations collected in a question-answering task (N=40) and compared them empirically to state-of-the-art XAI explanations (integrated gradients, conservative LRP, and ChatGPT) in a human-participant study (N=136). Our findings show that participants found human saliency maps to be more helpful in explaining AI answers than machine saliency maps, but performance negatively correlated with trust in the AI model and explanations. This finding hints at the dilemma of AI errors in explanation, where helpful explanations can lead to lower task performance when they support wrong AI predictions.
Paper Structure (51 sections, 5 figures, 4 tables)

This paper contains 51 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Screen shots of two question-answering tasks. Participants were exposed to questions, source texts, answers, and explanations, and asked to evaluate the answers with the help of the explanation and source text. Despite presenting both human and machine-generated saliency- and text-based explanations across multiple between-subjects conditions, all participants were informed that the answers were AI-generated. The example on the left featured an answer from Bert, our language model, with a saliency map produced via conservative-LRP, while the right showcased a human-provided answer and contrastive explanation.
  • Figure 2: Our study, divided into three parts, aimed to assess the efficacy of various human and machine-generated saliency- and text-based explanations through empirical research. Initially (left section), we gathered human explanations from 40 participants in a crowdsourcing task, where they were tasked with creating saliency-based explanations by highlighting source text and providing text explanations. Subsequently, we utilized techniques such as conservative-LRP, integrated gradients, and ChatGPT to generate machine explanations. Analysis (middle section) revealed a limited overlap (21%) between human and machine-generated saliency maps, with the latter being more dispersed. Through thematic analysis, we developed a coding scheme to classify human explanations into four categories, which was then independently applied. In the final part (right section), 136 participants evaluated the collected and generated explanations across seven conditions, including four saliency-based and three text-based, including control conditions. This evaluation measured both objective (e.g., performance) and subjective (e.g., satisfaction) metrics, exposing participants to both correct and incorrect answers within a between-subjects design.
  • Figure 3: The overlap between AI-generated and human saliency maps varies with the number of AI attributions considered in the analysis. Techniques such as conservative-LRP and integrated gradients assign scores to each word, with full inclusion resulting in the greatest overlap with human maps. However, to minimize explanation clutter and align the visualization of attribution scores with the human average (approximately 15 words, indicated by the blue line), it is observed that the marginal benefit of including additional attributions beyond the first 15 diminishes, leading to a plateau in overlap. Consequently, for empirical evaluation of machine-generated saliency maps, only the 15 most significant attribution scores were visualized.
  • Figure 4: Connection between hypotheses, scales, tests, and results. We hypothesize (top left corner) that, for each of the measures (performance, satisfaction, etc.), the existence of three groups (a, b, and c) that include conditions who are significantly different to all conditions in all other groups on this measure. In the top right, we provide a table that includes basic information about the measures we used in this study including the scale of the measure, whether it is a Likert scale, what test we applied to hypotheses of the measure, and whether we ran a mixed model for the measure which included repeated measurements from participants (i.e., eight question-answering tasks). While we were not able to confirm most of our hypotheses, we represent the most interesting findings of the study: there was a significant difference for trust for text-based explanations, and significant differences for quality, helpfulness, and time for saliency-based explanations. We present significant effects that partially confirm or contradict our hypotheses with the help of black and grey parentheses, respectively.
  • Figure 5: We present the main results of our study for saliency-maps (light blue background) and text-based explanations (red background). For each of our between-subjects conditions, we present the mean and 95% confidence intervals (CIs) for satisfaction, trust, and curiosity in the first row, performance and time in the second row, and quality, helpfulness and mental effort in the third row. In the second row, we include measures for two extra control conditions, No saliency (Control) and No explanation (Control), in which participants evaluated the same set of AI answers in a question-answering task than in the other conditions, but without any explanation. In the last row, we display the means and CIs for both incorrect (blue) and correct (black) AI answers. As can been in the plots, non-overlapping CIs indicate significant effects: between text extraction and ChatGPT for trust, between human and machine saliency maps for helpfulness, between human saliency maps and integrated gradients for time.