Table of Contents
Fetching ...

Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding

Zhihan Zhang, Yixin Cao, Chenchen Ye, Yunshan Ma, Lizi Liao, Tat-Seng Chua

TL;DR

The paper introduces Temporal Complex Events (TCE) and a large-scale benchmark, TCELongBench, to evaluate LLMs on temporal reasoning and long-context understanding across three QA tasks: detailed retrieval, temporal ordering, and forecasting. It proposes an LLM-based outline extraction pipeline that builds coherent TCE outlines from long, multi-article timelines and a generate-then-verify data construction paradigm to create 88,821 QA pairs from 2,289 TCEs. Through extensive experiments comparing retrieval-augmented generation and long-context LLMs, the study finds that retrievers can match long-context models under suitable configurations, though long-context models excel in temporal sequencing while forecasting remains challenging. The work provides insights into model, retrieval, and prompt strategies for temporal, long-text reasoning and offers TCELongBench as a resource to drive future research in temporal NLP and decision-support systems.

Abstract

The digital landscape is rapidly evolving with an ever-increasing volume of online news, emphasizing the need for swift and precise analysis of complex events. We refer to the complex events composed of many news articles over an extended period as Temporal Complex Event (TCE). This paper proposes a novel approach using Large Language Models (LLMs) to systematically extract and analyze the event chain within TCE, characterized by their key points and timestamps. We establish a benchmark, named TCELongBench, to evaluate the proficiency of LLMs in handling temporal dynamics and understanding extensive text. This benchmark encompasses three distinct tasks - reading comprehension, temporal sequencing, and future event forecasting. In the experiment, we leverage retrieval-augmented generation (RAG) method and LLMs with long context window to deal with lengthy news articles of TCE. Our findings indicate that models with suitable retrievers exhibit comparable performance with those utilizing long context window.

Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding

TL;DR

The paper introduces Temporal Complex Events (TCE) and a large-scale benchmark, TCELongBench, to evaluate LLMs on temporal reasoning and long-context understanding across three QA tasks: detailed retrieval, temporal ordering, and forecasting. It proposes an LLM-based outline extraction pipeline that builds coherent TCE outlines from long, multi-article timelines and a generate-then-verify data construction paradigm to create 88,821 QA pairs from 2,289 TCEs. Through extensive experiments comparing retrieval-augmented generation and long-context LLMs, the study finds that retrievers can match long-context models under suitable configurations, though long-context models excel in temporal sequencing while forecasting remains challenging. The work provides insights into model, retrieval, and prompt strategies for temporal, long-text reasoning and offers TCELongBench as a resource to drive future research in temporal NLP and decision-support systems.

Abstract

The digital landscape is rapidly evolving with an ever-increasing volume of online news, emphasizing the need for swift and precise analysis of complex events. We refer to the complex events composed of many news articles over an extended period as Temporal Complex Event (TCE). This paper proposes a novel approach using Large Language Models (LLMs) to systematically extract and analyze the event chain within TCE, characterized by their key points and timestamps. We establish a benchmark, named TCELongBench, to evaluate the proficiency of LLMs in handling temporal dynamics and understanding extensive text. This benchmark encompasses three distinct tasks - reading comprehension, temporal sequencing, and future event forecasting. In the experiment, we leverage retrieval-augmented generation (RAG) method and LLMs with long context window to deal with lengthy news articles of TCE. Our findings indicate that models with suitable retrievers exhibit comparable performance with those utilizing long context window.
Paper Structure (28 sections, 3 equations, 8 figures, 16 tables)

This paper contains 28 sections, 3 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: An example of temporal complex event (TCE) around Israeli-Palestinian conflict during December 2017. A TCE consists of many news articles with multiple timestamps. Our work extracts the outline of TCE.
  • Figure 2: Pipeline of outline extraction and generate-then-verify paradigm.
  • Figure 3: Distributions of day gaps (a) and number of tokens (b). Histograms are with the left y-axis and lines of kernel density estimation are with the right y-axis.
  • Figure 4: Question types in TLB-detail and TLB-forecast.
  • Figure 5: Evaluation pipeline of models using RAG method and LLM with Long Context Window.
  • ...and 3 more figures