Table of Contents
Fetching ...

Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy

Philipp Schoenegger, Indre Tuminauskaite, Peter S. Park, Philip E. Tetlock

TL;DR

The paper demonstrates that aggregating forecasts from a diverse ensemble of twelve LLMs can achieve forecasting accuracy rivaling human crowd performance on real-time binary prediction tasks, exemplifying a 'wisdom of the silicon crowd' using simple median-aggregation. Study 1 shows the LLM crowd beats a 50% baseline and is statistically indistinguishable from the human crowd, with some models underperforming and calibration biases evident. Study 2 reveals that exposing LLMs to human median forecasts improves accuracy and narrows uncertainty, but updating does not surpass naive averaging with human forecasts. The findings support practical deployment of LLM ensembles for probabilistic forecasting while highlighting calibration and bias limitations and suggesting directions to scale and refine this approach for broad societal impact.

Abstract

Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is not statistically different from the human crowd. In exploratory analyses, we find that these two approaches are equivalent with respect to medium-effect-size equivalence bounds. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the 'wisdom of the crowd' effect for LLMs, and opens up their use for a variety of applications throughout society.

Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy

TL;DR

The paper demonstrates that aggregating forecasts from a diverse ensemble of twelve LLMs can achieve forecasting accuracy rivaling human crowd performance on real-time binary prediction tasks, exemplifying a 'wisdom of the silicon crowd' using simple median-aggregation. Study 1 shows the LLM crowd beats a 50% baseline and is statistically indistinguishable from the human crowd, with some models underperforming and calibration biases evident. Study 2 reveals that exposing LLMs to human median forecasts improves accuracy and narrows uncertainty, but updating does not surpass naive averaging with human forecasts. The findings support practical deployment of LLM ensembles for probabilistic forecasting while highlighting calibration and bias limitations and suggesting directions to scale and refine this approach for broad societal impact.

Abstract

Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human crowd forecasting tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of twelve LLMs. We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark and is not statistically different from the human crowd. In exploratory analyses, we find that these two approaches are equivalent with respect to medium-effect-size equivalence bounds. We also observe an acquiescence effect, with mean model predictions being significantly above 50%, despite an almost even split of positive and negative resolutions. Moreover, in Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%: though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of human crowd forecasting tournaments: via the simple, practically applicable method of forecast aggregation. This replicates the 'wisdom of the crowd' effect for LLMs, and opens up their use for a variety of applications throughout society.
Paper Structure (8 sections, 2 equations, 9 figures, 3 tables)

This paper contains 8 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Full prompt for Study 1
  • Figure 2: LLM Ensemble Mechanism Overview
  • Figure 3: Initial prompt for Study 2
  • Figure 4: Prediction intervention prompt for Study 2
  • Figure 5: Scatter Plot of all LLM predictions across all questions
  • ...and 4 more figures