Human Expertise in Algorithmic Prediction

Rohan Alur; Manish Raghavan; Devavrat Shah

Human Expertise in Algorithmic Prediction

Rohan Alur, Manish Raghavan, Devavrat Shah

TL;DR

It is found empirically that although algorithms often outperform their human counterparts on average, human judgment can improve algorithmic predictions on specific instances, and this approach provides a natural way of uncovering this heterogeneity and thus enabling effective human-AI collaboration.

Abstract

We introduce a novel framework for incorporating human expertise into algorithmic predictions. Our approach leverages human judgment to distinguish inputs which are algorithmically indistinguishable, or "look the same" to predictive algorithms. We argue that this framing clarifies the problem of human-AI collaboration in prediction tasks, as experts often form judgments by drawing on information which is not encoded in an algorithm's training data. Algorithmic indistinguishability yields a natural test for assessing whether experts incorporate this kind of "side information", and further provides a simple but principled method for selectively incorporating human feedback into algorithmic predictions. We show that this method provably improves the performance of any feasible algorithmic predictor and precisely quantify this improvement. We find empirically that although algorithms often outperform their human counterparts on average, human judgment can improve algorithmic predictions on specific instances (which can be identified ex-ante). In an X-ray classification task, we find that this subset constitutes nearly $30\%$ of the patient population. Our approach provides a natural way of uncovering this heterogeneity and thus enabling effective human-AI collaboration.

Human Expertise in Algorithmic Prediction

TL;DR

Abstract

of the patient population. Our approach provides a natural way of uncovering this heterogeneity and thus enabling effective human-AI collaboration.

Paper Structure (26 sections, 19 theorems, 76 equations, 14 figures, 1 algorithm)

This paper contains 26 sections, 19 theorems, 76 equations, 14 figures, 1 algorithm.

Introduction
Related work
Methodology and preliminaries
Technical results
Experiments
Chest X-ray interpretation
Prediction of success in human collaboration
Robustness to noncompliance
Discussion and limitations
Additional technical results
A nonlinear analog of \ref{['thm: optimality of linear regression']}
A finite sample analog of \ref{['cor: high dimensional feedback']}
Extending \ref{['lemma: finite sample single subset']} to a partition of $\mathcal{X}$.
The impossibility of arbitrary deferral policies
Learning multicalibrated partitions
...and 11 more sections

Key Result

Theorem 4.1

Let $\{S_k\}_{k \in [K]}$ be an $\alpha$-multicalibrated partition with respect to a model class $\mathcal{F}$ and target $Y$. Let the random variable $J(X) \in [K]$ be such that $J(X) = k$ iff $X \in S_k$. Define $\gamma^*, \beta^* \in \mathbb{R}^K$ as Then, for any $f \in \mathcal{F}$ and $k \in [K]$,

Figures (14)

Figure 1: Partitions which are approximately multicalibrated with respect to the class of hyperplane classifiers (we consider the empirical distribution placing equal probability on each observation). In both panels, no hyperplane classifier has significant discriminatory power within each subset.
Figure 2: The relative performance of radiologists and predictive algorithms for detecting atelectasis. Each bar plots the Matthews Correlation Coefficient between the corresponding prediction and the ground truth label. Point estimates are reported with $95\%$ bootstrap confidence intervals.
Figure 3: Conditional performance for atelectasis. Within subset $0$ ($n$ = $148$), all algorithms predict $Y$=$1$, thus achieving true positive rate (TPR) $1$, true negative rate (TNR) $0$, and an MCC of $0$. Radiologists achieve a corresponding (TPR, TNR) of $(84.0\%, 42.9\%)$, $(72.6\%, 47.6\%)$ and $(93.4\%, 19.0\%)$, respectively. Subset $1$ ($n$ = $352$) contains the remaining patients. The baseline is a random permutation of the labels. Confidence intervals for algorithmic performance are not strictly valid (subsets are chosen conditional on the predictions), but are included for reference. All else is as in \ref{['fig: radiologist v algo accuracy']}.
Figure 4: Human performance within the approximate level sets of a predictor $h$ which is multicalibrated over $\mathcal{F}^{\text{RT}5}$. Level sets $0, 1,$ and $10$ are the sets $\{x \mid h(x) = 0\}$, $\{x \mid h(x) \in (0, .1]\}$, and $\{x \mid h(x) \in [.9, 1]\}$, and contain $259, 309$ and $292$ observations, respectively. All other level sets are empty in our test set. A random permutation of the labels is included as a baseline.
Figure 5: The relative performance of radiologists and predictive algorithms for detecting a pleural effusion. Each bar plots the Matthews Correlation Coefficient between the corresponding prediction and the ground truth label. Point estimates are reported with $95\%$ bootstrap confidence intervals.
...and 9 more figures

Theorems & Definitions (38)

Definition 3.1: $\alpha$-Indistinguishable subset
Definition 3.2: $\alpha$-Multicalibrated partition
Theorem 4.1
Corollary 4.2
Theorem 4.3
Theorem 6.1
Corollary A.1
Lemma A.2
Corollary A.3
Lemma A.4
...and 28 more

Human Expertise in Algorithmic Prediction

TL;DR

Abstract

Human Expertise in Algorithmic Prediction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (38)