Table of Contents
Fetching ...

Can Language Models Use Forecasting Strategies?

Sarah Pratt, Seth Blumberg, Pietro Kreitlon Carolino, Meredith Ringel Morris

TL;DR

This paper investigates whether large language models can forecast real-world events using a novel dataset (GleanGen) and a Brier-score based evaluation. It implements several superforecasting-inspired prompting strategies and compares LLM forecasters to human prediction markets. The key finding is that the simplest baseline prompt often matches or exceeds human performance, while more complex prompting strategies do not consistently improve accuracy, likely due to model biases toward low-probability predictions. To account for dataset imbalance, the authors introduce the Weighted Brier Score and discuss limitations such as training cutoffs and the need for standardized benchmarks and human–LLM collaboration in future forecasting research.

Abstract

Advances in deep learning systems have allowed large models to match or surpass human accuracy on a number of skills such as image classification, basic programming, and standardized test taking. As the performance of the most capable models begin to saturate on tasks where humans already achieve high accuracy, it becomes necessary to benchmark models on increasingly complex abilities. One such task is forecasting the future outcome of events. In this work we describe experiments using a novel dataset of real world events and associated human predictions, an evaluation metric to measure forecasting ability, and the accuracy of a number of different LLM based forecasting designs on the provided dataset. Additionally, we analyze the performance of the LLM forecasters against human predictions and find that models still struggle to make accurate predictions about the future. Our follow-up experiments indicate this is likely due to models' tendency to guess that most events are unlikely to occur (which tends to be true for many prediction datasets, but does not reflect actual forecasting abilities). We reflect on next steps for developing a systematic and reliable approach to studying LLM forecasting.

Can Language Models Use Forecasting Strategies?

TL;DR

This paper investigates whether large language models can forecast real-world events using a novel dataset (GleanGen) and a Brier-score based evaluation. It implements several superforecasting-inspired prompting strategies and compares LLM forecasters to human prediction markets. The key finding is that the simplest baseline prompt often matches or exceeds human performance, while more complex prompting strategies do not consistently improve accuracy, likely due to model biases toward low-probability predictions. To account for dataset imbalance, the authors introduce the Weighted Brier Score and discuss limitations such as training cutoffs and the need for standardized benchmarks and human–LLM collaboration in future forecasting research.

Abstract

Advances in deep learning systems have allowed large models to match or surpass human accuracy on a number of skills such as image classification, basic programming, and standardized test taking. As the performance of the most capable models begin to saturate on tasks where humans already achieve high accuracy, it becomes necessary to benchmark models on increasingly complex abilities. One such task is forecasting the future outcome of events. In this work we describe experiments using a novel dataset of real world events and associated human predictions, an evaluation metric to measure forecasting ability, and the accuracy of a number of different LLM based forecasting designs on the provided dataset. Additionally, we analyze the performance of the LLM forecasters against human predictions and find that models still struggle to make accurate predictions about the future. Our follow-up experiments indicate this is likely due to models' tendency to guess that most events are unlikely to occur (which tends to be true for many prediction datasets, but does not reflect actual forecasting abilities). We reflect on next steps for developing a systematic and reliable approach to studying LLM forecasting.
Paper Structure (18 sections, 1 equation, 7 figures, 6 tables)

This paper contains 18 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Predictions for events in the Validation set with and without the model first producing a rationale. When prompted to produce a rationale, the model predicts a consistently higher probability. This underlying bias of the model to produce a low probability when no rationale is required (paired with the skewed distribution towards events that do not occur) may explain why the simplest baseline outperforms all other strategies. Figure \ref{['larger']} in Appendix \ref{['larger1']} shows this figure at a larger scale.
  • Figure 2: Example event from GleanGen. Each event contains a description of the event, as well as a specific condition that must be met for the event to resolve as true. Events are binary: the condition is either met by the specified expiration date, or it is not. Additionally, there are human predictions for each event. These predictions come in the form of a range of probabilities over time. From the time the event is created to the expiration date of the event, market participants may update their beliefs based on constantly changing information. Additionally, the human predictions are in the form of a prediction market, meaning that there is a spread of human predictions rather than just a single value. A larger spread can be interpreted as more uncertainty.
  • Figure 3: Breakdown of event types in Validation set. The majority of events resolve 'No,' meaning the that condition of the event did not take place by the expiration date. The events are distributed over four categories: Technology Industry, Finance, Covid-19, and Misc. Additionally, the number of events that are active for a given date vary significantly, with the peak occurring just before the end of 2022.
  • Figure 4: Example of unclear model cut-off date. When ChatGPT is first asked its cut-off date, it states it is September 2021. It then will not answer a question about 2022. However, when directly asked a question about 2022 in a new chat window, the model is able to correctly answer. These chats both took place on November 8, 2023. This strongly suggests that the model is in fact trained on data after its stated cut-off date. This uncertainly makes analysing a model's forecasting ability challenging.
  • Figure 5: Schematic for single forecasting strategy. Event data is input into the first module which is instructed to extract two to three keywords for each event (e.g., for the event "Tesla L3 Autonomy (3): Tesla reaches L3 autonomy (driving required only when prompted)", this module extracts the words 'Tesla', 'Autonomy', and 'Driving' ). These search terms are then used to retrieve articles from The New York Times and Hacker News APIs. The articles that are retrieved are then post-processed by an LLM module which is instructed to remove unrelated headlines and summarize the relevant information returned by the News APIs. Finally, the predictor module uses these summaries as well as the details of the event to make a final prediction. Prompts and pipelines for all forecasting strategies are given in Appendix \ref{['prompts']}
  • ...and 2 more figures