Table of Contents
Fetching ...

Auditing for Human Expertise

Rohan Alur, Loren Laine, Darrick K. Li, Manish Raghavan, Devavrat Shah, Dennis Shung

TL;DR

A simple procedure which tests whether expert predictions are statistically independent from the outcomes of interest after conditioning on the available inputs suggests that human experts may add value to any algorithm trained on the Available data, and has direct implications for whether human-AI `complementarity' is achievable in a given prediction task.

Abstract

High-stakes prediction tasks (e.g., patient diagnosis) are often handled by trained human experts. A common source of concern about automation in these settings is that experts may exercise intuition that is difficult to model and/or have access to information (e.g., conversations with a patient) that is simply unavailable to a would-be algorithm. This raises a natural question whether human experts add value which could not be captured by an algorithmic predictor. We develop a statistical framework under which we can pose this question as a natural hypothesis test. Indeed, as our framework highlights, detecting human expertise is more subtle than simply comparing the accuracy of expert predictions to those made by a particular learning algorithm. Instead, we propose a simple procedure which tests whether expert predictions are statistically independent from the outcomes of interest after conditioning on the available inputs (`features'). A rejection of our test thus suggests that human experts may add value to any algorithm trained on the available data, and has direct implications for whether human-AI `complementarity' is achievable in a given prediction task. We highlight the utility of our procedure using admissions data collected from the emergency department of a large academic hospital system, where we show that physicians' admit/discharge decisions for patients with acute gastrointestinal bleeding (AGIB) appear to be incorporating information that is not available to a standard algorithmic screening tool. This is despite the fact that the screening tool is arguably more accurate than physicians' discretionary decisions, highlighting that -- even absent normative concerns about accountability or interpretability -- accuracy is insufficient to justify algorithmic automation.

Auditing for Human Expertise

TL;DR

A simple procedure which tests whether expert predictions are statistically independent from the outcomes of interest after conditioning on the available inputs suggests that human experts may add value to any algorithm trained on the Available data, and has direct implications for whether human-AI `complementarity' is achievable in a given prediction task.

Abstract

High-stakes prediction tasks (e.g., patient diagnosis) are often handled by trained human experts. A common source of concern about automation in these settings is that experts may exercise intuition that is difficult to model and/or have access to information (e.g., conversations with a patient) that is simply unavailable to a would-be algorithm. This raises a natural question whether human experts add value which could not be captured by an algorithmic predictor. We develop a statistical framework under which we can pose this question as a natural hypothesis test. Indeed, as our framework highlights, detecting human expertise is more subtle than simply comparing the accuracy of expert predictions to those made by a particular learning algorithm. Instead, we propose a simple procedure which tests whether expert predictions are statistically independent from the outcomes of interest after conditioning on the available inputs (`features'). A rejection of our test thus suggests that human experts may add value to any algorithm trained on the available data, and has direct implications for whether human-AI `complementarity' is achievable in a given prediction task. We highlight the utility of our procedure using admissions data collected from the emergency department of a large academic hospital system, where we show that physicians' admit/discharge decisions for patients with acute gastrointestinal bleeding (AGIB) appear to be incorporating information that is not available to a standard algorithmic screening tool. This is despite the fact that the screening tool is arguably more accurate than physicians' discretionary decisions, highlighting that -- even absent normative concerns about accountability or interpretability -- accuracy is insufficient to justify algorithmic automation.
Paper Structure (16 sections, 11 theorems, 60 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 11 theorems, 60 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Given $\alpha \in (0, 1)$ and parameters $K \geq 1, L \geq 1$, the Type I error of ExpertTest satisfies Where $\varepsilon^*_{n, L}$ is defined as follows

Figures (6)

  • Figure 1: Distribution of Euclidian distances between each pair of patients chosen by ExpertTest when patients are represented as a vector of nine patient characteristics, of which four -- blood urea nitrogen (BUN), hemoglobin (HGB), systolic blood pressure (SBP) and pulse -- are real-valued. $L$ indicates the number of pairs of patients chosen for each experiment, with the boxplot indicating the distribution of pairwise Euclidian distances between them. The red line at $\sqrt{9} = 3$ indicates the maximum possible Euclidian distance in this feature space.
  • Figure 2: distribution of $\tau$ is sharply nonuniform when the expert incorporates unobserved information $U$ in the toy example. The vertical red line indicates a critical threshold of $\alpha = .05$, and the dashed line traces a uniform distribution.
  • Figure 3: distribution of $\tau$ is approximately uniform when the expert does not incorporate unobserved information in the toy example. The vertical red line indicates a critical threshold of $\alpha = .05$, and the dashed line traces a uniform distribution.
  • Figure 4: The power of ExpertTest as a function of sample size $n$ and expertise parameter $\delta$. The horizontal dashed line corresponds to a power of $80\%$
  • Figure 5: The power of ExpertTest as a function of $L$, with $n=600, \delta = .2$. The horizontal dashed line corresponds to a power of $80\%$
  • ...and 1 more figures

Theorems & Definitions (11)

  • Theorem 1: Validity of ExpertTest
  • Theorem 2: Asymptotic Validity
  • Lemma 3: Bounding the total variation distance between i.i.d. coin flips
  • Corollary 3.1: Weaker type I error bound
  • Lemma 4: Existence of an optimal matching
  • Lemma 5: Greedy approximation to the optimal matching
  • Corollary 5.1
  • Corollary 5.2
  • Lemma 6: Pairwise distance in terms of packing number
  • Corollary 6.1
  • ...and 1 more