How to Measure Human-AI Prediction Accuracy in Explainable AI Systems

Sujay Koujalgi; Andrew Anderson; Iyadunni Adenuga; Shikha Soneji; Rupika Dikkala; Teresita Guzman Nader; Leo Soccio; Sourav Panda; Rupak Kumar Das; Margaret Burnett; Jonathan Dodge

How to Measure Human-AI Prediction Accuracy in Explainable AI Systems

Sujay Koujalgi, Andrew Anderson, Iyadunni Adenuga, Shikha Soneji, Rupika Dikkala, Teresita Guzman Nader, Leo Soccio, Sourav Panda, Rupak Kumar Das, Margaret Burnett, Jonathan Dodge

TL;DR

The operationalization of the prediction task and analysis methodology will improve the rigor of user studies conducted with that task, which is particularly important when the domain features a large output space.

Abstract

Assessing an AI system's behavior-particularly in Explainable AI Systems-is sometimes done empirically, by measuring people's abilities to predict the agent's next move-but how to perform such measurements? In empirical studies with humans, an obvious approach is to frame the task as binary (i.e., prediction is either right or wrong), but this does not scale. As output spaces increase, so do floor effects, because the ratio of right answers to wrong answers quickly becomes very small. The crux of the problem is that the binary framing is failing to capture the nuances of the different degrees of "wrongness." To address this, we begin by proposing three mathematical bases upon which to measure "partial wrongness." We then uses these bases to perform two analyses on sequential decision-making domains: the first is an in-lab study with 86 participants on a size-36 action space; the second is a re-analysis of a prior study on a size-4 action space. Other researchers adopting our operationalization of the prediction task and analysis methodology will improve the rigor of user studies conducted with that task, which is particularly important when the domain features a large output space.

How to Measure Human-AI Prediction Accuracy in Explainable AI Systems

TL;DR

Abstract

Paper Structure (45 sections, 6 equations, 10 figures, 5 tables)

This paper contains 45 sections, 6 equations, 10 figures, 5 tables.

Introduction
Background and Related Work
Human-AI Relationships in AI Systems
AI Explanations
Opaque Box Explanations
Transparent Box Explanations
Hybrid Explanations
How XAI Researchers Evaluate AI Explanations
Data Collection Methods
Study 1 - MNK Games Domain
Domain
Agent
Explanation 1: Scores Through-Time (STT)
Explanation 2: Scores On-the-Board (OTB)
Explanation 3: Scores Best-to-Worst (BTW)
...and 30 more sections

Figures (10)

Figure 1: Explanations adapted from Dodge et al. dodge2022people. Top:Scores Through-Time (STT); Middle:Scores On-the-Board (OTB); and Bottom:Scores Best-to-Worst (BTW). Insets are NOT part of the interface, but we provide them for greater figure clarity.
Figure 2: An example of how participants saw the dialogue box (right) and made their predictions. Notice that they selected coordinates of their prediction via drop-down menus, to specify a column letter and row number. Once they had formulated a prediction, they provided their justifications in the first text box, with an optional scratch pad below. While they were making their predictions, the "Step" button was not available (left, top), but they could Rewind via the slider (left).
Figure 3: A notional illustration of our three measurement constructs in teal text. Suppose an agent in the Four Towers domain predicts values as the bar chart shows (sorted in decreasing order, as in the BTW explanation shown in Figure \ref{['figureExpls']}). The agent will select Q3 (shown in red) for this decision, since it has the max value. Suppose a particular participant predicted Q4 (also in red), $LV(Q3, Q4)$ would compute the difference in value space as shown. Now, in rank space, $LR(Q3, Q4) = 2$ as shown, because regardless of the numerical values predicted, this is their order in the action list sorted by value. $mRBO$ grabs exactly that rank list and compares it against some reference list. In our work, we combine predictions a group of participants in a voting schema to create a rank order, but the rank order could also come from asking each participant for a (partial) ordering of what they think the agent will prefer.
Figure 4: Illustration of the visual difference between the binary prediction framing vs one of the partial credit systems we propose applied to Study 1's data from the MNK games domain. Left: Overview of binary prediction correctness for every participant, showing correct predictions in black and incorrect predictions in white. Both floor and ceiling effects are prevalent, which hinder comparative statistics. Right: Overview of distributions of grades $DLR()$ for every participant at every prediction. Unfortunately, floor and ceiling effects are still present (e.g., predictions 2 and 3 have moved from the floor to the ceiling). Only prediction 1 seems well conditioned, and that is where we will find our only statistically significant result.
Figure 5: Every participant's correctness on every prediction, divided by treatment. Top left: Prediction 1. Top right: Prediction 2. Bottom left: Prediction 3. Bottom right: Prediction 4. Now how the floor and ceiling effects we saw in the data overall are even worse when dividing by treatment.
...and 5 more figures

How to Measure Human-AI Prediction Accuracy in Explainable AI Systems

TL;DR

Abstract

How to Measure Human-AI Prediction Accuracy in Explainable AI Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (10)