ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Ezra Karger; Houtan Bastani; Chen Yueh-Han; Zachary Jacobs; Danny Halawi; Fred Zhang; Philip E. Tetlock

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, Philip E. Tetlock

TL;DR

ForecastBench presents a dynamic, leakage-free benchmark that continuously updates 1,000 forecasting questions about unresolved future events to evaluate AI forecasting capabilities in real time. It combines automated question generation with nighty resolution updates, a public leaderboard, and four expanding datasets (general public, superforecasters, LLMs, and question-resolutions) to enable cross-model and human benchmarking. Initial results show that even state-of-the-art LLMs, even when augmented with retrieval and crowd forecasts, lag behind expert superforecasters on a representative 200-question human subset, with a statistically significant gap ($p$-value < $0.001$). The framework and auxiliary datasets aim to accelerate progress in AI forecasting and support robust, real-world decision-making, with future work focusing on model adaptation and more sophisticated reasoning over time-series and cross-domain data.

Abstract

Forecasts of future events are essential inputs into informed decision-making. Machine learning (ML) systems have the potential to deliver forecasts at scale, but there is no framework for evaluating the accuracy of ML systems on a standardized set of forecasting questions. To address this gap, we introduce ForecastBench: a dynamic benchmark that evaluates the accuracy of ML systems on an automatically generated and regularly updated set of 1,000 forecasting questions. To avoid any possibility of data leakage, ForecastBench is comprised solely of questions about future events that have no known answer at the time of submission. We quantify the capabilities of current ML systems by collecting forecasts from expert (human) forecasters, the general public, and LLMs on a random subset of questions from the benchmark ($N=200$). While LLMs have achieved super-human performance on many benchmarks, they perform less well here: expert forecasters outperform the top-performing LLM ($p$-value $<0.001$). We display system and human scores in a public leaderboard at www.forecastbench.org.

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

TL;DR

-value <

). The framework and auxiliary datasets aim to accelerate progress in AI forecasting and support robust, real-world decision-making, with future work focusing on model adaptation and more sophisticated reasoning over time-series and cross-domain data.

Abstract

). While LLMs have achieved super-human performance on many benchmarks, they perform less well here: expert forecasters outperform the top-performing LLM (

-value

). We display system and human scores in a public leaderboard at www.forecastbench.org.

Paper Structure (87 sections, 10 figures, 26 tables)

This paper contains 87 sections, 10 figures, 26 tables.

Introduction
Related work
Automated forecasting
Language model evaluation
Preliminaries
Forecasting
Metrics
Models
Benchmark, leaderboard, and datasets
Question bank
Questions and resolution values
Markets
Datasets
Question Bank
Question metadata
...and 72 more sections

Figures (10)

Figure 1: The graphs show the linear relationship between the Brier scores from \ref{['table:combined_leaderboard_panel_a']} and (a) Chatbot Arena scores and (b) estimates of training compute. The dotted blue line represents the Superforecasters' overall Brier score. A red dot with a bootstrapped 95% confidence interval is placed at the intersection of this dotted blue line with the dashed linear fit line to demonstrate the potential intersection of LLM Arena score/training compute and Superforecaster-level forecasting performance. For (b), if estimates from EpochNotableModels2024 were not available, we produced estimates following https://epoch.ai/blog/estimating-training-compute. The trend-line in (a) is $y = 0.506 - 0.000298x$ ($R^2 = 0.47$) and in (b) it is $y = 0.844 - 0.01213x$ ($R^2 = 0.41$).
Figure 2: An example market-based question from the human survey.
Figure 3: An example question generated from a data provider, in this case DBnomics, from the public survey. Two of eight forecast horizons for which we elicited forecasts are included above. The rationale text boxes (one for each forecast horizon) have also been excluded from the screenshot for brevity.
Figure 4: Zero-shot Prompt from halawi2024approaching
Figure 5: Scratchpad Prompt modified from halawi2024approaching
...and 5 more figures

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

TL;DR

Abstract

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Authors

TL;DR

Abstract

Table of Contents

Figures (10)