Table of Contents
Fetching ...

Generative Models, Humans, Predictive Models: Who Is Worse at High-Stakes Decision Making?

Keri Mallari, Julius Adebayo, Kori Inkpen, Martin T. Wells, Albert Gordo, Sarah Tan

TL;DR

The paper interrogates whether large language models can outperform humans or established predictive models in high-stakes recidivism decisions. It jointly analyzes text and multimodal inputs by combining COMPAS data, human judgments, and Chicago Face Database images, employing zero-shot and in-context prompting across multiple models. Key findings show LMs are not superior to humans or COMPAS, but can align with human judgments when provided with additional information; multimodal distractors can unpredictably influence predictions, and bias-mitigation prompts yield inconsistent effects across models. The work offers important cautionary evidence about deploying LMs for risk assessment and contributes to understanding how information modalities and prompting strategies shape decision-making in high-stakes contexts.

Abstract

Despite strong advisory against it, large generative models (LMs) are already being used for decision making tasks that were previously done by predictive models or humans. We put popular LMs to the test in a high-stakes decision making task: recidivism prediction. Studying three closed-access and open-source LMs, we analyze the LMs not exclusively in terms of accuracy, but also in terms of agreement with (imperfect, noisy, and sometimes biased) human predictions or existing predictive models. We conduct experiments that assess how providing different types of information, including distractor information such as photos, can influence LM decisions. We also stress test techniques designed to either increase accuracy or mitigate bias in LMs, and find that some to have unintended consequences on LM decisions. Our results provide additional quantitative evidence to the wisdom that current LMs are not the right tools for these types of tasks.

Generative Models, Humans, Predictive Models: Who Is Worse at High-Stakes Decision Making?

TL;DR

The paper interrogates whether large language models can outperform humans or established predictive models in high-stakes recidivism decisions. It jointly analyzes text and multimodal inputs by combining COMPAS data, human judgments, and Chicago Face Database images, employing zero-shot and in-context prompting across multiple models. Key findings show LMs are not superior to humans or COMPAS, but can align with human judgments when provided with additional information; multimodal distractors can unpredictably influence predictions, and bias-mitigation prompts yield inconsistent effects across models. The work offers important cautionary evidence about deploying LMs for risk assessment and contributes to understanding how information modalities and prompting strategies shape decision-making in high-stakes contexts.

Abstract

Despite strong advisory against it, large generative models (LMs) are already being used for decision making tasks that were previously done by predictive models or humans. We put popular LMs to the test in a high-stakes decision making task: recidivism prediction. Studying three closed-access and open-source LMs, we analyze the LMs not exclusively in terms of accuracy, but also in terms of agreement with (imperfect, noisy, and sometimes biased) human predictions or existing predictive models. We conduct experiments that assess how providing different types of information, including distractor information such as photos, can influence LM decisions. We also stress test techniques designed to either increase accuracy or mitigate bias in LMs, and find that some to have unintended consequences on LM decisions. Our results provide additional quantitative evidence to the wisdom that current LMs are not the right tools for these types of tasks.

Paper Structure

This paper contains 18 sections, 13 figures.

Figures (13)

  • Figure 1: Hypothetical defendant in constructed dataset.
  • Figure 2: Baseline prompt template asking an LLM to predict recidivism given {defendant_description}. {Additional_Info} and {Illegal_Ignore} are italicized as a reminder that they are different for each experiment.
  • Figure 3: Proportion of Positive Outcome (PPO, top row) and accuracy (bottom row) for humans and different predictive and generative models, both for all races and sliced by race. For the PPO plots, the horizontal line denotes the ground truth rate, i.e., any bar above that line means that there is over-prediction of positive cases. LLM models are the average of 3 runs and we report standard deviation. Llama did not exhibit variations across runs in this experiment and therefore the standard deviation is zero. Humans and COMPAS Th. only predicted the labels once and therefore there is no standard deviation to report. Last, COMPAS Th. does allegedly not use race and therefore is included only in the "no race used" bar.
  • Figure 4: Agreement with humans (top row) and COMPAS Th. (second row), PPO (third row) and accuracy (last row) for the LM models. The horizontal lines for accuracy denote the human accuracy, and for PPO the ground truth rate. "+ Race" indicates that race was included in the prompt. "+ Human" and "+ COMPAS" indicate that in-context learning was used to incorporate the decisions of humans and/or from the COMPAS model. See Figure \ref{['fig:other_prompts']} in the appendix for details about the exact prompts that were used.
  • Figure 5: Rows 1 and 3: PPO and accuracy of the complete set of experiments, including in-context additional information and multimodal. Rows 2 and 4: Mitigation experiments, where the "Illegal-Ignore" mitigation technique has been applied to the experiments presented in rows 1 and 3. "+ Race" indicates that race was included in the prompt. "+ Human" and "+ COMPAS" indicate that in-context learning was used to incorporate the decisions of humans and/or from the COMPAS model. "+ Matched Photo" and "+ Placeholder Photo" indicate that a photo (either matched, or the placeholder template) was provided to the model. See Figure \ref{['fig:other_prompts']} in the appendix for details about the exact prompts that were used.
  • ...and 8 more figures