Table of Contents
Fetching ...

Hybrid Forecasting of Geopolitical Events

Daniel M. Benjamin, Fred Morstatter, Ali E. Abbas, Andres Abeliuk, Pavel Atanasov, Stephen Bennett, Andreas Beger, Saurabh Birari, David V. Budescu, Michele Catasta, Emilio Ferrara, Lucas Haravitch, Mark Himmelstein, KSM Tozammel Hossain, Yuzhong Huang, Woojeong Jin, Regina Joseph, Jure Leskovec, Akira Matsui, Mehrnoosh Mirtaheri, Xiang Ren, Gleb Satyukov, Rajiv Sethi, Amandeep Singh, Rok Sosic, Mark Steyvers, Pedro A Szekely, Michael D. Ward, Aram Galstyan

TL;DR

The paper addresses the accuracy and scalability challenges of forecasting geopolitical events by proposing SAGE, a hybrid platform that couples human judgments with machine-generated forecasts. It evaluates SAGE in a large-scale Hybrid Forecasting Competition, showing that incorporating machine predictions into both interfaces and aggregation improves accuracy relative to human-only baselines, especially for skilled forecasters. Key contributions include a robust aggregation framework that weighs sources by skill, a time-series and non-time-series data pipeline, an IFP recommender, and training to improve user engagement and trust. The findings demonstrate modest yet reliable gains and support the practical viability of hybrid forecasting to scale predictive effort with limited human resources.

Abstract

Sound decision-making relies on accurate prediction for tangible outcomes ranging from military conflict to disease outbreaks. To improve crowdsourced forecasting accuracy, we developed SAGE, a hybrid forecasting system that combines human and machine generated forecasts. The system provides a platform where users can interact with machine models and thus anchor their judgments on an objective benchmark. The system also aggregates human and machine forecasts weighting both for propinquity and based on assessed skill while adjusting for overconfidence. We present results from the Hybrid Forecasting Competition (HFC) - larger than comparable forecasting tournaments - including 1085 users forecasting 398 real-world forecasting problems over eight months. Our main result is that the hybrid system generated more accurate forecasts compared to a human-only baseline which had no machine generated predictions. We found that skilled forecasters who had access to machine-generated forecasts outperformed those who only viewed historical data. We also demonstrated the inclusion of machine-generated forecasts in our aggregation algorithms improved performance, both in terms of accuracy and scalability. This suggests that hybrid forecasting systems, which potentially require fewer human resources, can be a viable approach for maintaining a competitive level of accuracy over a larger number of forecasting questions.

Hybrid Forecasting of Geopolitical Events

TL;DR

The paper addresses the accuracy and scalability challenges of forecasting geopolitical events by proposing SAGE, a hybrid platform that couples human judgments with machine-generated forecasts. It evaluates SAGE in a large-scale Hybrid Forecasting Competition, showing that incorporating machine predictions into both interfaces and aggregation improves accuracy relative to human-only baselines, especially for skilled forecasters. Key contributions include a robust aggregation framework that weighs sources by skill, a time-series and non-time-series data pipeline, an IFP recommender, and training to improve user engagement and trust. The findings demonstrate modest yet reliable gains and support the practical viability of hybrid forecasting to scale predictive effort with limited human resources.

Abstract

Sound decision-making relies on accurate prediction for tangible outcomes ranging from military conflict to disease outbreaks. To improve crowdsourced forecasting accuracy, we developed SAGE, a hybrid forecasting system that combines human and machine generated forecasts. The system provides a platform where users can interact with machine models and thus anchor their judgments on an objective benchmark. The system also aggregates human and machine forecasts weighting both for propinquity and based on assessed skill while adjusting for overconfidence. We present results from the Hybrid Forecasting Competition (HFC) - larger than comparable forecasting tournaments - including 1085 users forecasting 398 real-world forecasting problems over eight months. Our main result is that the hybrid system generated more accurate forecasts compared to a human-only baseline which had no machine generated predictions. We found that skilled forecasters who had access to machine-generated forecasts outperformed those who only viewed historical data. We also demonstrated the inclusion of machine-generated forecasts in our aggregation algorithms improved performance, both in terms of accuracy and scalability. This suggests that hybrid forecasting systems, which potentially require fewer human resources, can be a viable approach for maintaining a competitive level of accuracy over a larger number of forecasting questions.

Paper Structure

This paper contains 25 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Schematic of SAGE system organized into five topic areas. Platform engineering is in pink, recruitment and retention is in blue, machine-based forecasting is in yellow, human-machine interaction is in green, and diagnostics and feedback is in purple.
  • Figure 2: Screen capture of an IFP with resolution criteria.
  • Figure 3: Schematic illustration of information presented to participants in each experimental condition.
  • Figure 4: Relative performance of two model-based forecasts compared to average human performance and the best human forecast-only aggregation model. Auto ARIMA was a mainstay model throughout; PHE2 emerged later as a top performer. This figure includes performance on 153 IFPs for which all models had forecasts. Red points mark IFPs with known quality issues that were retained for the sake of coverage.
  • Figure 5: Average aggregate performance (Brier score) as a function of the proportion of human forecasts removed from the forecasting pool (Sparsity). Higher Brier scores correspond to worse aggregate accuracy. Each point corresponds to aggregate performance for a random subset of censored forecasts. The line plots the linear regression of these points and the shaded region is the 95% confidence interval based on N=20 simulations.
  • ...and 1 more figures