Approaching Human-Level Forecasting with Language Models
Danny Halawi, Fred Zhang, Chen Yueh-Han, Jacob Steinhardt
TL;DR
The paper introduces a retrieval-augmented forecasting pipeline that uses language models to search for up-to-date information, reason about binary outcomes, and ensemble forecasts. By collecting a large, real-world dataset from five forecasting platforms and evaluating on post-cutoff test questions, the authors demonstrate that their end-to-end system approaches, and in some cases surpasses, the crowd aggregate in predictive accuracy as measured by the Brier score. A key contribution is the self-supervised fine-tuning of a reasoning model, guided by comparisons to crowd performance, alongside a hyperparameter sweep that optimizes retrieval, prompting, and ensembling strategies. The work shows that LM-based forecasting can scale and potentially inform decision-making, and it provides a publicly released dataset to enable further research in automated, calibrated forecasting and its integration with human judgments.
Abstract
Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it. Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help to inform institutional decision making.
