Future Is Unevenly Distributed: Forecasting Ability of LLMs Depends on What We're Asking
Chinmay Karkar, Paras Chopra
TL;DR
The paper investigates how forecasting performance by LLMs varies with domain structure and prompt framing, focusing on events beyond the models' training cutoffs. It introduces a data-processing pipeline to curate a balanced real-world forecasting benchmark from prediction markets and evaluates multiple model families with and without external news context, using metrics such as accuracy, the $Brier Score$, and the $ECE$. Key findings show substantial variability in forecasting ability across domains and prompt setups, with news context sometimes improving and other times degrading performance due to issues like recency bias, rumor overweighting, and definition drift. The work argues for benchmark designs that disentangle knowledge recall from probabilistic inference and highlights the conditional nature of future-reasoning ability in LLMs.
Abstract
Large Language Models (LLMs) demonstrate partial forecasting competence across social, political, and economic events. Yet, their predictive ability varies sharply with domain structure and prompt framing. We investigate how forecasting performance varies with different model families on real-world questions about events that happened beyond the model cutoff date. We analyze how context, question type, and external knowledge affect accuracy and calibration, and how adding factual news context modifies belief formation and failure modes. Our results show that forecasting ability is highly variable as it depends on what, and how, we ask.
