Table of Contents
Fetching ...

Statistical Tests for Replacing Human Decision Makers with Algorithms

Kai Feng, Han Hong, Ke Tang, Jingyuan Wang

TL;DR

This paper addresses the problem of when to replace human decision makers with algorithmic predictions in high-stakes binary classification tasks. It introduces a formal framework that separates information processing from incentives, and develops both a heuristic frequentist and a Bayesian posterior-loss approach to identify underperforming doctors relative to a machine ROC curve using an NFPC abnormal birth-detection dataset. Empirically, replacing a substantial subset of doctors with machine recommendations improves overall performance, with the Bayesian method showing particularly strong gains (e.g., up to ~46% increases in TPR and notable reductions in FPR) and revealing geographic patterns in replacement propensity. The study highlights the practicality of clinician–algorithm substitution, demonstrates the value of uncertainty-aware decision rules, and discusses implications for phased adoption and policy design in health economics and beyond.

Abstract

This paper proposes a statistical framework of using artificial intelligence to improve human decision making. The performance of each human decision maker is benchmarked against that of machine predictions. We replace the diagnoses made by a subset of the decision makers with the recommendation from the machine learning algorithm. We apply both a heuristic frequentist approach and a Bayesian posterior loss function approach to abnormal birth detection using a nationwide dataset of doctor diagnoses from prepregnancy checkups of reproductive age couples and pregnancy outcomes. We find that our algorithm on a test dataset results in a higher overall true positive rate and a lower false positive rate than the diagnoses made by doctors only.

Statistical Tests for Replacing Human Decision Makers with Algorithms

TL;DR

This paper addresses the problem of when to replace human decision makers with algorithmic predictions in high-stakes binary classification tasks. It introduces a formal framework that separates information processing from incentives, and develops both a heuristic frequentist and a Bayesian posterior-loss approach to identify underperforming doctors relative to a machine ROC curve using an NFPC abnormal birth-detection dataset. Empirically, replacing a substantial subset of doctors with machine recommendations improves overall performance, with the Bayesian method showing particularly strong gains (e.g., up to ~46% increases in TPR and notable reductions in FPR) and revealing geographic patterns in replacement propensity. The study highlights the practicality of clinician–algorithm substitution, demonstrates the value of uncertainty-aware decision rules, and discusses implications for phased adoption and policy design in health economics and beyond.

Abstract

This paper proposes a statistical framework of using artificial intelligence to improve human decision making. The performance of each human decision maker is benchmarked against that of machine predictions. We replace the diagnoses made by a subset of the decision makers with the recommendation from the machine learning algorithm. We apply both a heuristic frequentist approach and a Bayesian posterior loss function approach to abnormal birth detection using a nationwide dataset of doctor diagnoses from prepregnancy checkups of reproductive age couples and pregnancy outcomes. We find that our algorithm on a test dataset results in a higher overall true positive rate and a lower false positive rate than the diagnoses made by doctors only.
Paper Structure (28 sections, 3 theorems, 77 equations, 22 figures, 4 tables)

This paper contains 28 sections, 3 theorems, 77 equations, 22 figures, 4 tables.

Key Result

Lemma 2.1

For an i.i.d. sample $\left\{Y_{i}, \widehat{Y}_{i}\right\}^{n}_{i}$, the joint asymptotic distribution of a human FPR/TPR pair $\hat{\theta}_{H} =$α̂_H, β̂_H$$ is multivariate normal. In particular

Figures (22)

  • Figure 1: Individual and aggregate FPR/TPR pairs
  • Figure 2: Population human FPR/TPR pair and the machine ROC curve
  • Figure 3: Three cases of the heuristic approach
  • Figure 4: Empirical results of combining doctors' and machine decisions using the heuristic frequentist approach (doctors' diagnoses >= 300).
  • Figure 5: Scatter plot of replaced and retained doctors in the test dataset using the frequentist approach (doctors' diagnoses >= 300)
  • ...and 17 more figures

Theorems & Definitions (4)

  • Lemma 2.1
  • Remark 1
  • Lemma A.1
  • Lemma A.2