Table of Contents
Fetching ...

AIA Forecaster: Technical Report

Rohan Alur, Bradly C. Stadie, Daniel Kang, Ryan Chen, Matt McManus, Michael Rickert, Tyler Lee, Michael Federici, Richard Zhu, Dennis Fogerty, Hayley Williamson, Nina Lozinski, Aaron Linsky, Jasjeet S. Sekhon

TL;DR

The paper tackles judgmental forecasting with large language models by building the AIA Forecaster, a multi‑agent system that conducts agentic search, uses a supervisor to reconcile divergent forecasts, and applies statistical calibration to counteract LLM hedging. It demonstrates expert‑level performance on the ForecastBench benchmarks and introduces MarketLiquid, a harder live‑markets dataset, showing that combining AIA forecasts with market prices can yield additive value. Key contributions include an end‑to‑end forecasting architecture, a systematic study of search and foreknowledge bias, and evidence that ensembling plus calibration yields robust, scalable forecasting at or beyond human expert levels. The work provides practical guidelines and a state‑of‑the‑art baseline for AI forecasting with transferable implications for policy, economics, and risk assessment.

Abstract

This technical report describes the AIA Forecaster, a Large Language Model (LLM)-based system for judgmental forecasting using unstructured data. The AIA Forecaster approach combines three core elements: agentic search over high-quality news sources, a supervisor agent that reconciles disparate forecasts for the same event, and a set of statistical calibration techniques to counter behavioral biases in large language models. On the ForecastBench benchmark (Karger et al., 2024), the AIA Forecaster achieves performance equal to human superforecasters, surpassing prior LLM baselines. In addition to reporting on ForecastBench, we also introduce a more challenging forecasting benchmark sourced from liquid prediction markets. While the AIA Forecaster underperforms market consensus on this benchmark, an ensemble combining AIA Forecaster with market consensus outperforms consensus alone, demonstrating that our forecaster provides additive information. Our work establishes a new state of the art in AI forecasting and provides practical, transferable recommendations for future research. To the best of our knowledge, this is the first work that verifiably achieves expert-level forecasting at scale.

AIA Forecaster: Technical Report

TL;DR

The paper tackles judgmental forecasting with large language models by building the AIA Forecaster, a multi‑agent system that conducts agentic search, uses a supervisor to reconcile divergent forecasts, and applies statistical calibration to counteract LLM hedging. It demonstrates expert‑level performance on the ForecastBench benchmarks and introduces MarketLiquid, a harder live‑markets dataset, showing that combining AIA forecasts with market prices can yield additive value. Key contributions include an end‑to‑end forecasting architecture, a systematic study of search and foreknowledge bias, and evidence that ensembling plus calibration yields robust, scalable forecasting at or beyond human expert levels. The work provides practical guidelines and a state‑of‑the‑art baseline for AI forecasting with transferable implications for policy, economics, and risk assessment.

Abstract

This technical report describes the AIA Forecaster, a Large Language Model (LLM)-based system for judgmental forecasting using unstructured data. The AIA Forecaster approach combines three core elements: agentic search over high-quality news sources, a supervisor agent that reconciles disparate forecasts for the same event, and a set of statistical calibration techniques to counter behavioral biases in large language models. On the ForecastBench benchmark (Karger et al., 2024), the AIA Forecaster achieves performance equal to human superforecasters, surpassing prior LLM baselines. In addition to reporting on ForecastBench, we also introduce a more challenging forecasting benchmark sourced from liquid prediction markets. While the AIA Forecaster underperforms market consensus on this benchmark, an ensemble combining AIA Forecaster with market consensus outperforms consensus alone, demonstrating that our forecaster provides additive information. Our work establishes a new state of the art in AI forecasting and provides practical, transferable recommendations for future research. To the best of our knowledge, this is the first work that verifiably achieves expert-level forecasting at scale.

Paper Structure

This paper contains 18 sections, 8 equations, 6 figures, 17 tables.

Figures (6)

  • Figure 1: The architecture of the AIA Forecaster
  • Figure 2: Example of news affecting the market price of a prediction market.
  • Figure 3: The Brier score induced at various ensemble sizes. Point estimates and $95\%$ confidence intervals are generated via bootstrap resampling from a set of 50 forecasts per question. The dashed line indicates the lower confidence bound for a single forecast.
  • Figure 4: Scaling and extremization shifts the mass of probabilities toward the extremities, especially in the center mass which produces larger drops in Brier scores.
  • Figure 5: Across probability bins, the largest drops in Brier scores due to correction comes from the 0.6-0.8 forecast bin, followed by the 0.2-0.4 forecast bin.
  • ...and 1 more figures