Table of Contents
Fetching ...

Does AI help humans make better decisions? A statistical evaluation framework for experimental and observational studies

Eli Ben-Michael, D. James Greiner, Melody Huang, Kosuke Imai, Zhichao Jiang, Sooahn Shin

TL;DR

The paper introduces a principled framework to evaluate whether AI assists humans in making better decisions by treating decision quality as a classification problem on baseline outcomes within a single-blinded, potentially unconfounded design. It develops point-identification results for human decisions with/without AI and partial identification bounds for AI-alone performance, complemented by estimation via AIPW and risk-based hypothesis testing. The authors apply the framework to a randomized trial of the PSA risk instrument in Dane County bail hearings, finding no improvement from PSA recommendations and that PSA-alone can be worse than human judgment, with Llama3 performing even worse as an AI-alone system. Policy-learning extensions give practical rules for when to provide AI input and when to follow AI, but the empirical gains are modest, underscoring the need for careful, context-specific evaluation of AI-assisted decision-making. The framework is generalizable to other domains and transfer to observational data under unconfoundedness, with potential extensions to non-binary decisions and joint potential outcomes.

Abstract

The use of Artificial Intelligence (AI), or more generally data-driven algorithms, has become ubiquitous in today's society. Yet, in many cases and especially when stakes are high, humans still make final decisions. The critical question, therefore, is whether AI helps humans make better decisions compared to a human-alone or AI-alone system. We introduce a new methodological framework to empirically answer this question with a minimal set of assumptions. We measure a decision maker's ability to make correct decisions using standard classification metrics based on the baseline potential outcome. We consider a single-blinded and unconfounded treatment assignment, where the provision of AI-generated recommendations is assumed to be randomized across cases with humans making final decisions. Under this study design, we show how to compare the performance of three alternative decision-making systems--human-alone, human-with-AI, and AI-alone. Importantly, the AI-alone system includes any individualized treatment assignment, including those that are not used in the original study. We also show when AI recommendations should be provided to a human-decision maker, and when one should follow such recommendations. We apply the proposed methodology to our own randomized controlled trial evaluating a pretrial risk assessment instrument. We find that the risk assessment recommendations do not improve the classification accuracy of a judge's decision to impose cash bail. Furthermore, we find that replacing a human judge with algorithms--the risk assessment score and a large language model in particular--leads to a worse classification performance.

Does AI help humans make better decisions? A statistical evaluation framework for experimental and observational studies

TL;DR

The paper introduces a principled framework to evaluate whether AI assists humans in making better decisions by treating decision quality as a classification problem on baseline outcomes within a single-blinded, potentially unconfounded design. It develops point-identification results for human decisions with/without AI and partial identification bounds for AI-alone performance, complemented by estimation via AIPW and risk-based hypothesis testing. The authors apply the framework to a randomized trial of the PSA risk instrument in Dane County bail hearings, finding no improvement from PSA recommendations and that PSA-alone can be worse than human judgment, with Llama3 performing even worse as an AI-alone system. Policy-learning extensions give practical rules for when to provide AI input and when to follow AI, but the empirical gains are modest, underscoring the need for careful, context-specific evaluation of AI-assisted decision-making. The framework is generalizable to other domains and transfer to observational data under unconfoundedness, with potential extensions to non-binary decisions and joint potential outcomes.

Abstract

The use of Artificial Intelligence (AI), or more generally data-driven algorithms, has become ubiquitous in today's society. Yet, in many cases and especially when stakes are high, humans still make final decisions. The critical question, therefore, is whether AI helps humans make better decisions compared to a human-alone or AI-alone system. We introduce a new methodological framework to empirically answer this question with a minimal set of assumptions. We measure a decision maker's ability to make correct decisions using standard classification metrics based on the baseline potential outcome. We consider a single-blinded and unconfounded treatment assignment, where the provision of AI-generated recommendations is assumed to be randomized across cases with humans making final decisions. Under this study design, we show how to compare the performance of three alternative decision-making systems--human-alone, human-with-AI, and AI-alone. Importantly, the AI-alone system includes any individualized treatment assignment, including those that are not used in the original study. We also show when AI recommendations should be provided to a human-decision maker, and when one should follow such recommendations. We apply the proposed methodology to our own randomized controlled trial evaluating a pretrial risk assessment instrument. We find that the risk assessment recommendations do not improve the classification accuracy of a judge's decision to impose cash bail. Furthermore, we find that replacing a human judge with algorithms--the risk assessment score and a large language model in particular--leads to a worse classification performance.
Paper Structure (44 sections, 12 theorems, 100 equations, 10 figures, 3 tables)

This paper contains 44 sections, 12 theorems, 100 equations, 10 figures, 3 tables.

Key Result

Theorem 1

(Identification of the difference in classification risk between human-alone and human-with-AI systems)Under Assumption assum:single_blinded, we can identify the difference in risk between human decisions with ($Z = 1$) and without ($Z = 0$) an AI recommendation as: where $R_{\textsc{human}}(\ell_{01}):=R(\ell_{01}; D(0))$ and $R_{\textsc{human+AI}}(\ell_{01}):=R(\ell_{01}; D(1))$ as defined in E

Figures (10)

  • Figure 1: Estimated Impact of PSA Recommendations on Human Decisions. The figure shows how PSA recommendations change a human judge's cash bail decisions in terms of misclassification rate, false negative proportion, and false positive proportion. Each panel presents the overall and subgroup-specific results for a different outcome variable. For each quantity of interest, we report a point estimate and its corresponding 95% confidence interval for the overall sample (red circle), non-white and white subgroups (blue triangle), and female and male subgroups (green square). The results show that the PSA recommendations do not significantly improve the judge's decisions.
  • Figure 2: Estimated Bounds on Difference in Classification Ability between PSA-alone and Human-alone Decisions. The figure shows the misclassification rate, false negative proportion, and false positive proportion. Each panel presents the overall and subgroup-specific results for a different outcome variable. For each quantity of interest, we report estimated bounds (thick lines) and their corresponding 95% confidence interval (thin lines) for the overall sample (red), non-white and white subgroups (blue), and female and male subgroups (green). The results indicate that PSA-alone decisions are less accurate than human judge's decisions in terms of the false positive proportion.
  • Figure 3: Estimated Preference for Human-alone Decisions over PSA-alone Decision-Making System. The figure illustrates the range of the ratio of the loss between false positives and false negatives, $\ell_{01}$, for which one decision-making system is preferable over the other. A greater value of the ratio $\ell_{01}$ implies a greater loss of false positive relative to that of false negative. Each panel displays the overall and subgroup-specific results for different outcome variables. For each quantity of interest, we show the range of $\ell_{01}$ that corresponds to the preferred decision-making system; human-alone (green lines), and ambiguous (dotted lines). The results suggest that the human-alone system is preferred over the PSA-alone system when the loss of false positive is about the same as or greater than that of false negative. The PSA-alone system is never preferred within the specified range of $\ell_{01}$.
  • Figure 4: Optimally Combining Human Decisions with PSA Recommendations when NCA is the outcome. The left plot shows an estimated optimal policy for determining when to provide PSA recommendations to a human judge. The right plot shows an estimated optimal policy regarding when a human decision-maker should follow PSA recommendations. Each shaded area represents the optimal policy for specific combinations of risk scores: light shading indicates a decision rule of "not provide" (left) or "do not follow" (right), while dark shading indicates a decision rule of "provide" (left) or "follow" (right). Unshaded areas represent combinations of risk scores that are not possible. The number of observations for each combination is also shown.
  • Figure 5: Estimated Bounds on the Difference in Classification Ability between Llama3 and Human-alone Decisions. The figure shows the differences in terms of misclassification rate, false negative proportion, and false positive proportion. Each panel presents the overall and subgroup-specific results for one of the three outcome variables. For each quantity of interest, we report estimated bounds (thick lines) and their corresponding 95% confidence interval (thin lines) for the overall sample (red), non-white and white subgroups (blue), and female and male subgroups (green). The results indicate that Llama3 decisions are less accurate than human judge's decisions in terms of the false positive proportion and the overall misclassification rate.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Theorem 1
  • Theorem 2: Asymptotic normality
  • Theorem 3
  • Theorem 4: Asymptotic Normality of Estimated Bounds
  • Theorem 5: Bounding the excess risk
  • Theorem 6: Bounding the excess worst-case risk
  • Theorem S1
  • Theorem S2
  • Lemma S1
  • Lemma S2
  • ...and 2 more