Navigating Tomorrow: Reliably Assessing Large Language Models Performance on Future Event Prediction

Petraq Nako; Adam Jatowt

Navigating Tomorrow: Reliably Assessing Large Language Models Performance on Future Event Prediction

Petraq Nako, Adam Jatowt

TL;DR

This study probes the ability of large language models to forecast future events using a temporally anchored, news-derived dataset that aligns with model training cutoffs. It systematically compares prompting strategies (Affirmative vs Likelihood), reasoning, and counterfactual perturbations, revealing that probabilistic (Likelihood) prompts generally improve precision while reasoning increases recall at the risk of more false positives. Counterfactual analyses show models are sensitive to small changes in details, highlighting robustness limits for real-world forecasting. By constructing a time-aware dataset and performing comprehensive analyses across entity types and popularity, the work provides practical guidance for deploying LLMs in predictive contexts and outlines clear directions for future improvements.

Abstract

Predicting future events is an important activity with applications across multiple fields and domains. For example, the capacity to foresee stock market trends, natural disasters, business developments, or political events can facilitate early preventive measures and uncover new opportunities. Multiple diverse computational methods for attempting future predictions, including predictive analysis, time series forecasting, and simulations have been proposed. This study evaluates the performance of several large language models (LLMs) in supporting future prediction tasks, an under-explored domain. We assess the models across three scenarios: Affirmative vs. Likelihood questioning, Reasoning, and Counterfactual analysis. For this, we create a dataset1 by finding and categorizing news articles based on entity type and its popularity. We gather news articles before and after the LLMs training cutoff date in order to thoroughly test and compare model performance. Our research highlights LLMs potential and limitations in predictive modeling, providing a foundation for future improvements.

Navigating Tomorrow: Reliably Assessing Large Language Models Performance on Future Event Prediction

TL;DR

Abstract

Paper Structure (22 sections, 4 figures, 4 tables)

This paper contains 22 sections, 4 figures, 4 tables.

Introduction
Related Work
Dataset
Entity Gathering
Determining Entity Popularity
Event Collection
Negative Instances
Question Generation
LLM Forecasting Analysis
Large Language Models
LLMs Question-Answering
Data Analysis Techniques
Findings and Discussion
Findings
Affirmative vs Likelihood Analysis
...and 7 more sections

Figures (4)

Figure 1: The pipeline of our dataset creation.
Figure 2: News article distribution before and after cut-off training date.
Figure 3: Confusion matrices of the Before vs. After categorization based on Llama2 70b, Gemma 7b, and GPT 3.5 Turbo models.
Figure 4: Performance comparison of the Popular vs. Unpopular categorization based on Llama2 70b, GPT 3.5 Turbo and Gemma 7b models.

Navigating Tomorrow: Reliably Assessing Large Language Models Performance on Future Event Prediction

TL;DR

Abstract

Navigating Tomorrow: Reliably Assessing Large Language Models Performance on Future Event Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)