Table of Contents
Fetching ...

Navigating Tomorrow: Reliably Assessing Large Language Models Performance on Future Event Prediction

Petraq Nako, Adam Jatowt

TL;DR

This study probes the ability of large language models to forecast future events using a temporally anchored, news-derived dataset that aligns with model training cutoffs. It systematically compares prompting strategies (Affirmative vs Likelihood), reasoning, and counterfactual perturbations, revealing that probabilistic (Likelihood) prompts generally improve precision while reasoning increases recall at the risk of more false positives. Counterfactual analyses show models are sensitive to small changes in details, highlighting robustness limits for real-world forecasting. By constructing a time-aware dataset and performing comprehensive analyses across entity types and popularity, the work provides practical guidance for deploying LLMs in predictive contexts and outlines clear directions for future improvements.

Abstract

Predicting future events is an important activity with applications across multiple fields and domains. For example, the capacity to foresee stock market trends, natural disasters, business developments, or political events can facilitate early preventive measures and uncover new opportunities. Multiple diverse computational methods for attempting future predictions, including predictive analysis, time series forecasting, and simulations have been proposed. This study evaluates the performance of several large language models (LLMs) in supporting future prediction tasks, an under-explored domain. We assess the models across three scenarios: Affirmative vs. Likelihood questioning, Reasoning, and Counterfactual analysis. For this, we create a dataset1 by finding and categorizing news articles based on entity type and its popularity. We gather news articles before and after the LLMs training cutoff date in order to thoroughly test and compare model performance. Our research highlights LLMs potential and limitations in predictive modeling, providing a foundation for future improvements.

Navigating Tomorrow: Reliably Assessing Large Language Models Performance on Future Event Prediction

TL;DR

This study probes the ability of large language models to forecast future events using a temporally anchored, news-derived dataset that aligns with model training cutoffs. It systematically compares prompting strategies (Affirmative vs Likelihood), reasoning, and counterfactual perturbations, revealing that probabilistic (Likelihood) prompts generally improve precision while reasoning increases recall at the risk of more false positives. Counterfactual analyses show models are sensitive to small changes in details, highlighting robustness limits for real-world forecasting. By constructing a time-aware dataset and performing comprehensive analyses across entity types and popularity, the work provides practical guidance for deploying LLMs in predictive contexts and outlines clear directions for future improvements.

Abstract

Predicting future events is an important activity with applications across multiple fields and domains. For example, the capacity to foresee stock market trends, natural disasters, business developments, or political events can facilitate early preventive measures and uncover new opportunities. Multiple diverse computational methods for attempting future predictions, including predictive analysis, time series forecasting, and simulations have been proposed. This study evaluates the performance of several large language models (LLMs) in supporting future prediction tasks, an under-explored domain. We assess the models across three scenarios: Affirmative vs. Likelihood questioning, Reasoning, and Counterfactual analysis. For this, we create a dataset1 by finding and categorizing news articles based on entity type and its popularity. We gather news articles before and after the LLMs training cutoff date in order to thoroughly test and compare model performance. Our research highlights LLMs potential and limitations in predictive modeling, providing a foundation for future improvements.
Paper Structure (22 sections, 4 figures, 4 tables)

This paper contains 22 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The pipeline of our dataset creation.
  • Figure 2: News article distribution before and after cut-off training date.
  • Figure 3: Confusion matrices of the Before vs. After categorization based on Llama2 70b, Gemma 7b, and GPT 3.5 Turbo models.
  • Figure 4: Performance comparison of the Popular vs. Unpopular categorization based on Llama2 70b, GPT 3.5 Turbo and Gemma 7b models.