A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting

He Chang; Chenchen Ye; Zhulin Tao; Jie Wu; Zhengmao Yang; Yunshan Ma; Xianglin Huang; Tat-Seng Chua

A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting

He Chang, Chenchen Ye, Zhulin Tao, Jie Wu, Zhengmao Yang, Yunshan Ma, Xianglin Huang, Tat-Seng Chua

TL;DR

This work systematically evaluates large language models on text-involved temporal event forecasting by building MidEast-TE-mini, a benchmark combining structured events with accompanying news texts. It compares graph-only, text-only, and graph-text (mixed) methods, investigating prompt design and LoRA-based fine-tuning, plus retrieval-augmented generation with multiple retrievers and scopes. Key findings show that fine-tuning markedly improves performance, raw text input helps little in zero-shot settings, and retrieval modules can capture temporal patterns yet may introduce noise and popularity bias; complex-event retrieval generally yields best results. The study highlights practical directions for future research, including larger, higher-quality benchmarks and better alignment of graph and textual representations within LLMs to advance temporal event forecasting in real-world scenarios.$G_t$ and $\mathbf{G}_{<t}$ formalize the data, and $(S,O,T)$ represents the forecasting target in the formulated MCQ setup.$

Abstract

Recently, Large Language Models (LLMs) have demonstrated great potential in various data mining tasks, such as knowledge question answering, mathematical reasoning, and commonsense reasoning. However, the reasoning capability of LLMs on temporal event forecasting has been under-explored. To systematically investigate their abilities in temporal event forecasting, we conduct a comprehensive evaluation of LLM-based methods for temporal event forecasting. Due to the lack of a high-quality dataset that involves both graph and textual data, we first construct a benchmark dataset, named MidEast-TE-mini. Based on this dataset, we design a series of baseline methods, characterized by various input formats and retrieval augmented generation (RAG) modules. From extensive experiments, we find that directly integrating raw texts into the input of LLMs does not enhance zero-shot extrapolation performance. In contrast, fine-tuning LLMs with raw texts can significantly improve performance. Additionally, LLMs enhanced with retrieval modules can effectively capture temporal relational patterns hidden in historical events. However, issues such as popularity bias and the long-tail problem persist in LLMs, particularly in the retrieval-augmented generation (RAG) method. These findings not only deepen our understanding of LLM-based event forecasting methods but also highlight several promising research directions. We consider that this comprehensive evaluation, along with the identified research opportunities, will significantly contribute to future research on temporal event forecasting through LLMs.

A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting

TL;DR

and

formalize the data, and

represents the forecasting target in the formulated MCQ setup.$

Abstract

Paper Structure (31 sections, 8 figures, 10 tables)

This paper contains 31 sections, 8 figures, 10 tables.

Preliminary
Problem Formulation
Dataset Construction
Data Source
Construction Pipeline
Dataset Statistics
Methods
Graph-only Methods
Text-only Methods
Graph-and-Text (Mixed) Methods
Prompt Design
Fine-tuning
Experiments
Experimental Settings
Compared Methods
...and 16 more sections

Figures (8)

Figure 1: Illustration of leveraging LLM for temporal event forecasting. Given the complex event Israeli-Palestinian conflict , three formats of historical event representations, i.e., text (top side), graph (bottom side), or graph-text (both), can be fed into the LLMs, and the LLMs are expected to answer certain input questions about what will happend in the future.
Figure 2: The data distribution of MidEast-TE-m.
Figure 3: Illustration of rule-based history and retrieved history. The rule-based history is constructed by a set of predefined rules. In contrast, the retrieved history dynamically searches context from the temporal knowledge graph or news documents according to the current query.
Figure 4: Performance comparison considering varying historical length for the model of "retrieved history".
Figure 5: The performance comparison on long-term forecasting, where the sub-figure shows the accuracy of the different "retrieved history" models. The horizontal axis represents the time interval between the current timestamp and the query timestamp.
...and 3 more figures

A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting

TL;DR

Abstract

A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (8)