Table of Contents
Fetching ...

Approaching Human-Level Forecasting with Language Models

Danny Halawi, Fred Zhang, Chen Yueh-Han, Jacob Steinhardt

TL;DR

The paper introduces a retrieval-augmented forecasting pipeline that uses language models to search for up-to-date information, reason about binary outcomes, and ensemble forecasts. By collecting a large, real-world dataset from five forecasting platforms and evaluating on post-cutoff test questions, the authors demonstrate that their end-to-end system approaches, and in some cases surpasses, the crowd aggregate in predictive accuracy as measured by the Brier score. A key contribution is the self-supervised fine-tuning of a reasoning model, guided by comparisons to crowd performance, alongside a hyperparameter sweep that optimizes retrieval, prompting, and ensembling strategies. The work shows that LM-based forecasting can scale and potentially inform decision-making, and it provides a publicly released dataset to enable further research in automated, calibrated forecasting and its integration with human judgments.

Abstract

Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it. Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help to inform institutional decision making.

Approaching Human-Level Forecasting with Language Models

TL;DR

The paper introduces a retrieval-augmented forecasting pipeline that uses language models to search for up-to-date information, reason about binary outcomes, and ensemble forecasts. By collecting a large, real-world dataset from five forecasting platforms and evaluating on post-cutoff test questions, the authors demonstrate that their end-to-end system approaches, and in some cases surpasses, the crowd aggregate in predictive accuracy as measured by the Brier score. A key contribution is the self-supervised fine-tuning of a reasoning model, guided by comparisons to crowd performance, alongside a hyperparameter sweep that optimizes retrieval, prompting, and ensembling strategies. The work shows that LM-based forecasting can scale and potentially inform decision-making, and it provides a publicly released dataset to enable further research in automated, calibrated forecasting and its integration with human judgments.

Abstract

Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it. Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help to inform institutional decision making.
Paper Structure (78 sections, 1 equation, 21 figures, 14 tables)

This paper contains 78 sections, 1 equation, 21 figures, 14 tables.

Figures (21)

  • Figure 1: Overview of our retrieval and reasoning systems. Our retrieval system retrieves summarized new articles and feeds them into the reasoning system, which prompts LMs for reasonings and predictions that are aggregated into a final forecast.
  • Figure 2: Our procedure of generating data for self-supervised training. For each question, the method generates multiple candidate reasoning-prediction pairs and selects those that outperform human aggregates for fine-tuning.
  • Figure 3: Our system is naturally well calibrated on both (b) validation and (c) test. The crowd is also well calibrated, consistent with zou2022forecasting's findings. In contrast, the base models in the zero-shot setting (a) are less calibrated (\ref{['sec:0-shot']}).
  • Figure 4: System strengths. Evaluating on the validation set, we note: (a) When provided enough relevant articles, our system outperforms the crowd. (b) For questions where the crowd is unsure (predictions between $.3$ and $.7$), we outperform them (Brier score $.199$ vs. $.246$). However, the crowd outperforms our system on questions where they are highly confident, e.g. predicting less than .05. (c) Our system's Brier score is better at the earlier retrieval dates. Finally, our system is well-calibrated (c.f. \ref{['fig:calibration_validation']}).
  • Figure 5: The simple zero-shot prompt used for baseline evaluations. No retrieval is performed. The prompt simply asks the model to make a prediction on a given question from the test set. We add the directive "You MUST ... UNDER ALL CIRCUMSTANCES" to push the model to answer the question, which in some cases it refuses to, potentially due to safety training. See \ref{['sec:0-shot']} for results and \ref{['sec:base-eval']} for more details.
  • ...and 16 more figures