Do Language Models Update their Forecasts with New Information?
Zhangdie Yuan, Zifeng Ding, Andreas Vlachos
TL;DR
EvolveCast introduces a dynamic forecasting framework to assess how large language models revise probabilistic forecasts when exposed to post-cutoff information. By comparing model updates to aggregated human forecasts on Metaculus-derived questions and aligned news, the study demonstrates that LLMs exhibit only conservative, imperfect updates and suffer from miscalibration, even when confidence is elicited in verbalized or logit form. The authors deploy multiple ablations, including accumulated news context and direct directional prompting, finding that richer context often fails to improve alignment and can introduce noise. Overall, the work highlights fundamental challenges in belief dynamics for LLMs and emphasizes the need for mechanisms beyond simple retrieval to incorporate external knowledge into probabilistic reasoning. The framework and findings have implications for responsible forecasting with AI, suggesting cautious deployment and further research into belief updating and calibration methods.
Abstract
Prior work has largely treated forecasting as a static task, failing to consider how forecasts and the confidence in them should evolve as new evidence emerges. To address this gap, we introduce EvolveCast, a framework for evaluating whether large language models revise their forecasts appropriately in response to new information. In particular, EvolveCast assesses whether LLMs update their forecasts when presented with information released after their training cutoff. We use human forecasters as a comparative reference to assess forecast updates and confidence calibration under new information. While LLMs demonstrate some responsiveness to new information, their updates are often inconsistent or overly conservative. We further find that both verbalized and logits-based confidence estimates remain far from the human reference standard. Across settings with a variety of LLMs, models tend to be conservative in updating their forecasts. These findings suggest that current approaches (e.g., RAG-based methods) for updating model knowledge are insufficient for probabilistic reasoning; models treat new information as retrieval context rather than evidence that shifts posterior probability. EvolveCast thus underscores the need for more robust mechanisms to incorporate external knowledge into belief dynamics.
