Table of Contents
Fetching ...

AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy

Philipp Schoenegger, Peter S. Park, Ezra Karger, Sean Trott, Philip E. Tetlock

TL;DR

The paper evaluates whether interactive LLM assistants can boost human judgment in forecasting by comparing a superforecasting-prompted LLM, a biased noisy LLM, and a weaker control across six questions (N=991 after exclusions). Using preregistered analyses, it finds that both frontier LLMs improve individual forecasting accuracy relative to the control, with aggregate gains around $24\%$–$28\%$, though differences between the two prompts are not consistently robust. An outlier (Question 3) drives much of the observed pattern, and excluding it reveals that the superforecasting prompt can outperform the noisy prompt and control, while the noisy prompt alone also outperforms the control. The study finds no reliable evidence that LLM augmentation systematically affects crowd wisdom via aggregation or interacts robustly with forecaster skill or question difficulty, highlighting both the promise and fragility of human-AI hybrid forecasting and the need for further robustness and generalization work.

Abstract

Large language models (LLMs) match and sometimes exceeding human performance in many domains. This study explores the potential of LLMs to augment human judgement in a forecasting task. We evaluate the effect on human forecasters of two LLM assistants: one designed to provide high-quality ("superforecasting") advice, and the other designed to be overconfident and base-rate neglecting, thus providing noisy forecasting advice. We compare participants using these assistants to a control group that received a less advanced model that did not provide numerical predictions or engaged in explicit discussion of predictions. Participants (N = 991) answered a set of six forecasting questions and had the option to consult their assigned LLM assistant throughout. Our preregistered analyses show that interacting with each of our frontier LLM assistants significantly enhances prediction accuracy by between 24 percent and 28 percent compared to the control group. Exploratory analyses showed a pronounced outlier effect in one forecasting item, without which we find that the superforecasting assistant increased accuracy by 41 percent, compared with 29 percent for the noisy assistant. We further examine whether LLM forecasting augmentation disproportionately benefits less skilled forecasters, degrades the wisdom-of-the-crowd by reducing prediction diversity, or varies in effectiveness with question difficulty. Our data do not consistently support these hypotheses. Our results suggest that access to a frontier LLM assistant, even a noisy one, can be a helpful decision aid in cognitively demanding tasks compared to a less powerful model that does not provide specific forecasting advice. However, the effects of outliers suggest that further research into the robustness of this pattern is needed.

AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy

TL;DR

The paper evaluates whether interactive LLM assistants can boost human judgment in forecasting by comparing a superforecasting-prompted LLM, a biased noisy LLM, and a weaker control across six questions (N=991 after exclusions). Using preregistered analyses, it finds that both frontier LLMs improve individual forecasting accuracy relative to the control, with aggregate gains around , though differences between the two prompts are not consistently robust. An outlier (Question 3) drives much of the observed pattern, and excluding it reveals that the superforecasting prompt can outperform the noisy prompt and control, while the noisy prompt alone also outperforms the control. The study finds no reliable evidence that LLM augmentation systematically affects crowd wisdom via aggregation or interacts robustly with forecaster skill or question difficulty, highlighting both the promise and fragility of human-AI hybrid forecasting and the need for further robustness and generalization work.

Abstract

Large language models (LLMs) match and sometimes exceeding human performance in many domains. This study explores the potential of LLMs to augment human judgement in a forecasting task. We evaluate the effect on human forecasters of two LLM assistants: one designed to provide high-quality ("superforecasting") advice, and the other designed to be overconfident and base-rate neglecting, thus providing noisy forecasting advice. We compare participants using these assistants to a control group that received a less advanced model that did not provide numerical predictions or engaged in explicit discussion of predictions. Participants (N = 991) answered a set of six forecasting questions and had the option to consult their assigned LLM assistant throughout. Our preregistered analyses show that interacting with each of our frontier LLM assistants significantly enhances prediction accuracy by between 24 percent and 28 percent compared to the control group. Exploratory analyses showed a pronounced outlier effect in one forecasting item, without which we find that the superforecasting assistant increased accuracy by 41 percent, compared with 29 percent for the noisy assistant. We further examine whether LLM forecasting augmentation disproportionately benefits less skilled forecasters, degrades the wisdom-of-the-crowd by reducing prediction diversity, or varies in effectiveness with question difficulty. Our data do not consistently support these hypotheses. Our results suggest that access to a frontier LLM assistant, even a noisy one, can be a helpful decision aid in cognitively demanding tasks compared to a less powerful model that does not provide specific forecasting advice. However, the effects of outliers suggest that further research into the robustness of this pattern is needed.
Paper Structure (6 sections, 3 equations, 7 figures, 6 tables)

This paper contains 6 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Treatment interface.
  • Figure 2: Full prompt for the LLM Augmentation Treatment.
  • Figure 3: Raincloud plot of forecasting accuracy by condition.
  • Figure 4: CDF of forecasting accuracy by condition.
  • Figure 5: Full prompt for the noisy LLM Augmentation Treatment.
  • ...and 2 more figures