In Search of Verifiability: Explanations Rarely Enable Complementary Performance in AI-Advised Decision Making

Raymond Fok; Daniel S. Weld

In Search of Verifiability: Explanations Rarely Enable Complementary Performance in AI-Advised Decision Making

Raymond Fok, Daniel S. Weld

TL;DR

The paper tackles the inconsistent empirical findings on AI explanations in decision making by proposing a verifiability-centered theory: explanations improve team performance only when they enable verification of the AI's recommendation, with verification analogous to certificate checking in $NP$-complete problems. It reframes appropriate reliance as strategy-graded reliance, arguing that decisions should depend on expected relative performance rather than post-hoc outcomes. The discussion covers verification spectra, cognitive and interaction paradigms, and broader uses of explanations beyond immediate task performance, including collective decision making and regulatory contexts. Practically, the work guides XAI design toward verifiable, interaction-capable explanations that enable humans to detect AI errors and collaborate effectively with AI in real-world settings.

Abstract

The current literature on AI-advised decision making -- involving explainable AI systems advising human decision makers -- presents a series of inconclusive and confounding results. To synthesize these findings, we propose a simple theory that elucidates the frequent failure of AI explanations to engender appropriate reliance and complementary decision making performance. We argue explanations are only useful to the extent that they allow a human decision maker to verify the correctness of an AI's prediction, in contrast to other desiderata, e.g., interpretability or spelling out the AI's reasoning process. Prior studies find in many decision making contexts AI explanations do not facilitate such verification. Moreover, most tasks fundamentally do not allow easy verification, regardless of explanation method, limiting the potential benefit of any type of explanation. We also compare the objective of complementary performance with that of appropriate reliance, decomposing the latter into the notions of outcome-graded and strategy-graded reliance.

In Search of Verifiability: Explanations Rarely Enable Complementary Performance in AI-Advised Decision Making

TL;DR

-complete problems. It reframes appropriate reliance as strategy-graded reliance, arguing that decisions should depend on expected relative performance rather than post-hoc outcomes. The discussion covers verification spectra, cognitive and interaction paradigms, and broader uses of explanations beyond immediate task performance, including collective decision making and regulatory contexts. Practically, the work guides XAI design toward verifiable, interaction-capable explanations that enable humans to detect AI errors and collaborate effectively with AI in real-world settings.

Abstract

Paper Structure (16 sections, 6 figures, 1 table)

This paper contains 16 sections, 6 figures, 1 table.

Introduction
Background
AI-Advised Decision Making
Verifiability
Useful Explanations Allow Verification
A Spectrum of Verification
Verification and Human Intuitions
What is Appropriate Reliance?
Discussion
Faithfulness vs. Verifiability
Paradigms of XAI-Advised Decision Making
Beyond Complementary Performance
Verifiability in Collective Decision Making
Unequal distribution of responsibility and power
Rich interaction modalities benefit collective intelligence
...and 1 more sections

Figures (6)

Figure 1: Researchers suggest that AI explanations could aid numerous human-AI processes, including decision making, model development, knowledge discovery, and model audit. In this paper, we focus solely on understanding whether explanations are helpful in the context of AI-advised decision making. We claim AI explanations cannot foster appropriate reliance and engender complementary performance in decision making, except in the rare instances in which they efficiently verify the AI's recommendation.
Figure 2: Examples of AI-advised decision making tasks. The top two rows present prototypical tasks found in the XAI literature: deceptive review detection, with textual features, and recidivism prediction, with tabular features. Both tasks often utilize feature importance explanations, which studies suggest rarely engender complementary performance or foster appropriate reliance. The bottom two rows show examples of a maze solving and open-domain question answering task, for which studies have found AI explanations can engender complementary performance.
Figure 3: In decision making contexts where the AI outperforms the human, improved team performance can often be achieved by providing an AI explanation, as explanations tend to cause humans to trust the AI regardless of its correctness. Unfortunately, when the AI errs more than the human, these explanations can reduce performance due to over-reliance. However, explanations that support quick verification may still produce complementary performance by allowing the human to recognize situations where they have made a mistake---even if the AI is less accurate on average than the human.
Figure 4: Tasks in which AI explanations have been shown to provide verification of an AI recommendation (e.g., maze solving) resemble NP-complete decision problems. Complementary performance can be achieved in these cases because the AI does the time-consuming problem solving and the human can quickly verify the correctness of a solution. Of course, this verification process is task specific. Sometimes fast verification is only possible for some answer types; for example, an illegal maze path does not preclude the existence of a legal path. On average, however, these explanations speed human verification vasconcelos_when_2023.
Figure 5: The utility of an explanation is not a simple function of explanation type, as these two examples illustrate. Both use an explanation style that highlight key spans of text, but their ability to induce complementary performance is very different. While these explanations can be useful for factoid QA from trusted sources, they are less successful for sentiment analysis bansal_doesthewhole_2021. The difference lies in whether the human can reliably verify the AI's answer by solely inspecting the highlighted text.
...and 1 more figures

In Search of Verifiability: Explanations Rarely Enable Complementary Performance in AI-Advised Decision Making

TL;DR

Abstract

In Search of Verifiability: Explanations Rarely Enable Complementary Performance in AI-Advised Decision Making

Authors

TL;DR

Abstract

Table of Contents

Figures (6)