Table of Contents
Fetching ...

FOReCAst: The Future Outcome Reasoning and Confidence Assessment Benchmark

Zhangdie Yuan, Zifeng Ding, Andreas Vlachos

TL;DR

FOReCAst introduces a benchmark for evaluating future outcome reasoning with explicit confidence calibration across Boolean, Timeframe, and Quantity forecasting tasks, using gold confidence derived from crowdsourced forecasts on Metaculus. The framework formalizes predictions as $M(Q) \to (A,C)$ and provides task-specific evaluation metrics including $\text{Brier}$, $\text{ADE}$, and $\text{CRPS}$, with Gaussian uncertainty modeling for calibration. Experiments across a diverse set of LLMs reveal that forecasting remains challenging, with calibration inconsistent and only task-dependent gains from model size or recency; instruction tuning and aggregation strategies can improve certain aspects of calibration. The work highlights the need for dedicated uncertainty modeling in forecasting, demonstrates a scalable data pipeline for continual evaluation, and offers a foundation for future extensions across platforms and languages to support robust, calibrated decision-support systems.

Abstract

Forecasting is an important task in many domains, such as technology and economics. However existing forecasting benchmarks largely lack comprehensive confidence assessment, focus on limited question types, and often consist of artificial questions that do not align with real-world human forecasting needs. To address these gaps, we introduce FOReCAst (Future Outcome Reasoning and Confidence Assessment), a benchmark that evaluates models' ability to make predictions and their confidence in them. FOReCAst spans diverse forecasting scenarios involving Boolean questions, timeframe prediction, and quantity estimation, enabling a comprehensive evaluation of both prediction accuracy and confidence calibration for real-world applications.

FOReCAst: The Future Outcome Reasoning and Confidence Assessment Benchmark

TL;DR

FOReCAst introduces a benchmark for evaluating future outcome reasoning with explicit confidence calibration across Boolean, Timeframe, and Quantity forecasting tasks, using gold confidence derived from crowdsourced forecasts on Metaculus. The framework formalizes predictions as and provides task-specific evaluation metrics including , , and , with Gaussian uncertainty modeling for calibration. Experiments across a diverse set of LLMs reveal that forecasting remains challenging, with calibration inconsistent and only task-dependent gains from model size or recency; instruction tuning and aggregation strategies can improve certain aspects of calibration. The work highlights the need for dedicated uncertainty modeling in forecasting, demonstrates a scalable data pipeline for continual evaluation, and offers a foundation for future extensions across platforms and languages to support robust, calibrated decision-support systems.

Abstract

Forecasting is an important task in many domains, such as technology and economics. However existing forecasting benchmarks largely lack comprehensive confidence assessment, focus on limited question types, and often consist of artificial questions that do not align with real-world human forecasting needs. To address these gaps, we introduce FOReCAst (Future Outcome Reasoning and Confidence Assessment), a benchmark that evaluates models' ability to make predictions and their confidence in them. FOReCAst spans diverse forecasting scenarios involving Boolean questions, timeframe prediction, and quantity estimation, enabling a comprehensive evaluation of both prediction accuracy and confidence calibration for real-world applications.

Paper Structure

This paper contains 35 sections, 8 equations, 4 figures, 14 tables.

Figures (4)

  • Figure 1: Community prediction trend for a Metaculus question on TikTok’s availability in the US.
  • Figure 2: Histogram of final community forecasts.
  • Figure 3: Community prediction trend for SpaceX Starship's first orbital launch.
  • Figure 4: Probability density function of final community forecasts for SpaceX Starship reaching orbit.