Table of Contents
Fetching ...

Integrating Expert Judgment and Algorithmic Decision Making: An Indistinguishability Framework

Rohan Alur, Loren Laine, Darrick K. Li, Dennis Shung, Manish Raghavan, Devavrat Shah

TL;DR

A novel framework for human-AI collaboration in prediction and decision tasks that leverages human judgment to distinguish inputs which are algorithmically indistinguishable, or"look the same" to any feasible predictive algorithm.

Abstract

We introduce a novel framework for human-AI collaboration in prediction and decision tasks. Our approach leverages human judgment to distinguish inputs which are algorithmically indistinguishable, or "look the same" to any feasible predictive algorithm. We argue that this framing clarifies the problem of human-AI collaboration in prediction and decision tasks, as experts often form judgments by drawing on information which is not encoded in an algorithm's training data. Algorithmic indistinguishability yields a natural test for assessing whether experts incorporate this kind of "side information", and further provides a simple but principled method for selectively incorporating human feedback into algorithmic predictions. We show that this method provably improves the performance of any feasible algorithmic predictor and precisely quantify this improvement. We demonstrate the utility of our framework in a case study of emergency room triage decisions, where we find that although algorithmic risk scores are highly competitive with physicians, there is strong evidence that physician judgments provide signal which could not be replicated by any predictive algorithm. This insight yields a range of natural decision rules which leverage the complementary strengths of human experts and predictive algorithms.

Integrating Expert Judgment and Algorithmic Decision Making: An Indistinguishability Framework

TL;DR

A novel framework for human-AI collaboration in prediction and decision tasks that leverages human judgment to distinguish inputs which are algorithmically indistinguishable, or"look the same" to any feasible predictive algorithm.

Abstract

We introduce a novel framework for human-AI collaboration in prediction and decision tasks. Our approach leverages human judgment to distinguish inputs which are algorithmically indistinguishable, or "look the same" to any feasible predictive algorithm. We argue that this framing clarifies the problem of human-AI collaboration in prediction and decision tasks, as experts often form judgments by drawing on information which is not encoded in an algorithm's training data. Algorithmic indistinguishability yields a natural test for assessing whether experts incorporate this kind of "side information", and further provides a simple but principled method for selectively incorporating human feedback into algorithmic predictions. We show that this method provably improves the performance of any feasible algorithmic predictor and precisely quantify this improvement. We demonstrate the utility of our framework in a case study of emergency room triage decisions, where we find that although algorithmic risk scores are highly competitive with physicians, there is strong evidence that physician judgments provide signal which could not be replicated by any predictive algorithm. This insight yields a range of natural decision rules which leverage the complementary strengths of human experts and predictive algorithms.

Paper Structure

This paper contains 35 sections, 16 theorems, 64 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Theorem 4.1

Let $\{S_k\}_{k \in [K]}$ be an $\alpha$-multicalibrated partition with respect to a model class $\mathcal{F}$ and target $Y$. Let the random variable $J(X) \in [K]$ be such that $J(X) = k$ iff $X \in S_k$. Define $\gamma^*, \beta^* \in \mathbb{R}^K$ as Then, for any $f \in \mathcal{F}$ and $k \in [K]$,

Figures (5)

  • Figure 1: The correlation between the physician's decision (hospitalize or discharge) and adverse outcomes within the $669$ observationally indistinguishable groups of patients. Large points are statistically significant at the $5\%$ level (estimated from $1000$ bootstrap replicates). The right panel applies a Bonferroni correction to account for multiple testing (i.e., large points are significantly different from $0$ at the $.05/669$ level).
  • Figure 2: The correlation between the physician's decision (hospitalize or discharge) and adverse outcomes within the level sets of the Glasgow-Blatchford Score. Point estimates are reported with $95\%$ Bonferroni Corrected confidence intervals (estimated from $1000$ bootstrap replicates).
  • Figure 3: Partitions which are approximately multicalibrated with respect to the class of hyperplane classifiers. For illustration purposes, we consider the empirical distribution placing equal probability on each observation. In both panels, no hyperplane classifier has significant discriminatory power within each subset.
  • Figure 4: Physician performance within the level sets of a predictor which is multicalibrated with respect to $\mathcal{F}^{\text{RT}3}$. Point estimates are reported with $95\%$ Bonferroni corrected bootstrap confidence intervals (estimated from $1000$ bootstrap replicates).
  • Figure 5: The performance of policies which independently choose whether to hospitalize, discharge, or defer to the physician within each indistinguishable subset. The red policy defers to the physician within all but one subset; it achieves a true positive rate of $99.7\%$, a false positive rate of $77.0\%$, and automates $7.6\%$ of decisions. The blue policy only defers to the physician within one subset, achieving a true positive rate of $97.4\%$ and a false positive rate of $53.6\%$ while automating $86\%$ of decisions.

Theorems & Definitions (32)

  • Definition 3.1: $\alpha$-Indistinguishable subset
  • Definition 3.2: $\alpha$-Multicalibrated partition
  • Theorem 4.1
  • Corollary 4.2
  • Theorem 4.3
  • proof
  • Lemma A.1
  • proof
  • proof
  • Lemma C.1
  • ...and 22 more