Table of Contents
Fetching ...

Do Language Models Update their Forecasts with New Information?

Zhangdie Yuan, Zifeng Ding, Andreas Vlachos

TL;DR

EvolveCast introduces a dynamic forecasting framework to assess how large language models revise probabilistic forecasts when exposed to post-cutoff information. By comparing model updates to aggregated human forecasts on Metaculus-derived questions and aligned news, the study demonstrates that LLMs exhibit only conservative, imperfect updates and suffer from miscalibration, even when confidence is elicited in verbalized or logit form. The authors deploy multiple ablations, including accumulated news context and direct directional prompting, finding that richer context often fails to improve alignment and can introduce noise. Overall, the work highlights fundamental challenges in belief dynamics for LLMs and emphasizes the need for mechanisms beyond simple retrieval to incorporate external knowledge into probabilistic reasoning. The framework and findings have implications for responsible forecasting with AI, suggesting cautious deployment and further research into belief updating and calibration methods.

Abstract

Prior work has largely treated forecasting as a static task, failing to consider how forecasts and the confidence in them should evolve as new evidence emerges. To address this gap, we introduce EvolveCast, a framework for evaluating whether large language models revise their forecasts appropriately in response to new information. In particular, EvolveCast assesses whether LLMs update their forecasts when presented with information released after their training cutoff. We use human forecasters as a comparative reference to assess forecast updates and confidence calibration under new information. While LLMs demonstrate some responsiveness to new information, their updates are often inconsistent or overly conservative. We further find that both verbalized and logits-based confidence estimates remain far from the human reference standard. Across settings with a variety of LLMs, models tend to be conservative in updating their forecasts. These findings suggest that current approaches (e.g., RAG-based methods) for updating model knowledge are insufficient for probabilistic reasoning; models treat new information as retrieval context rather than evidence that shifts posterior probability. EvolveCast thus underscores the need for more robust mechanisms to incorporate external knowledge into belief dynamics.

Do Language Models Update their Forecasts with New Information?

TL;DR

EvolveCast introduces a dynamic forecasting framework to assess how large language models revise probabilistic forecasts when exposed to post-cutoff information. By comparing model updates to aggregated human forecasts on Metaculus-derived questions and aligned news, the study demonstrates that LLMs exhibit only conservative, imperfect updates and suffer from miscalibration, even when confidence is elicited in verbalized or logit form. The authors deploy multiple ablations, including accumulated news context and direct directional prompting, finding that richer context often fails to improve alignment and can introduce noise. Overall, the work highlights fundamental challenges in belief dynamics for LLMs and emphasizes the need for mechanisms beyond simple retrieval to incorporate external knowledge into probabilistic reasoning. The framework and findings have implications for responsible forecasting with AI, suggesting cautious deployment and further research into belief updating and calibration methods.

Abstract

Prior work has largely treated forecasting as a static task, failing to consider how forecasts and the confidence in them should evolve as new evidence emerges. To address this gap, we introduce EvolveCast, a framework for evaluating whether large language models revise their forecasts appropriately in response to new information. In particular, EvolveCast assesses whether LLMs update their forecasts when presented with information released after their training cutoff. We use human forecasters as a comparative reference to assess forecast updates and confidence calibration under new information. While LLMs demonstrate some responsiveness to new information, their updates are often inconsistent or overly conservative. We further find that both verbalized and logits-based confidence estimates remain far from the human reference standard. Across settings with a variety of LLMs, models tend to be conservative in updating their forecasts. These findings suggest that current approaches (e.g., RAG-based methods) for updating model knowledge are insufficient for probabilistic reasoning; models treat new information as retrieval context rather than evidence that shifts posterior probability. EvolveCast thus underscores the need for more robust mechanisms to incorporate external knowledge into belief dynamics.

Paper Structure

This paper contains 41 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The EvolveCast framework. Models are evaluated on their ability to update forecasts ($P_0 \rightarrow P_t$) when exposed to new evidence ($x_t$). The dashed arrow indicates an optional setting (Sec. \ref{['sec:additional-ablations']}) where the model leverages the historical human reference ($r_0$) as context. While the model correctly interprets the news as a positive signal (Direction: Up), it revises its confidence conservatively compared to the human reference, illustrating a magnitude mismatch (Up$+5\%$ vs. Up$+15\%$).
  • Figure 2: Normalized confusion matrices for DeepSeek R1 models under Single (S) and Accumulated (A) news updates. Columns correspond to models; rows correspond to evidence settings. Values are column-normalized, showing $\Pr(\text{pred} \mid \text{true})$ in %.
  • Figure 3: Community prediction trend for a Metaculus question on the US Senate filibuster issue.
  • Figure 4: Histogram of final community forecasts.
  • Figure 5: Delta confusion heatmaps (A$-$S) for DS R1 models. Each plot shows the difference between column-normalized confusion matrices under Accumulated vs. Single updates, i.e., changes in $\Pr(\text{pred}\mid\text{true})$ (percentage points). Positive values indicate increased mass under Accumulated; negative values indicate decreased mass.
  • ...and 2 more figures