Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Bahare Fatemi; Mehran Kazemi; Anton Tsitsulin; Karishma Malkan; Jinyeong Yim; John Palowitch; Sungyong Seo; Jonathan Halcrow; Bryan Perozzi

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, Bryan Perozzi

TL;DR

This paper introduces ToT (Test of Time), a benchmark designed to rigorously evaluate LLM temporal reasoning by decoupling semantic/temporal logic from temporal arithmetic. It presents two synthetic datasets, ToT-Semantic and ToT-Arithmetic, to isolate reasoning from prior knowledge and enable controlled analyses of problem structure, size, question type, and fact ordering. Through experiments with Claude-3-Sonnet, GPT-4, and Gemini 1.5 Pro, the authors show that graph structure and prompt organization significantly affect performance, with timeline and duration tasks revealing notable weaknesses. By open-sourcing the datasets and framework, the work aims to standardize evaluation and spur further research into robust temporal reasoning in LLMs.

Abstract

Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors, particularly in temporal reasoning tasks involving complex temporal logic. Existing research has explored LLM performance on temporal reasoning using diverse datasets and benchmarks. However, these studies often rely on real-world data that LLMs may have encountered during pre-training or employ anonymization techniques that can inadvertently introduce factual inconsistencies. In this work, we address these limitations by introducing novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios. The diversity of question types across these datasets enables systematic investigation into the impact of the problem structure, size, question type, fact order, and other factors on LLM performance. Our findings provide valuable insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks. To foster further research in this area, we are open-sourcing the datasets and evaluation framework used in our experiments: https://huggingface.co/datasets/baharef/ToT.

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

TL;DR

Abstract

Paper Structure (39 sections, 5 figures, 12 tables)

This paper contains 39 sections, 5 figures, 12 tables.

Introduction
Related work
ToT: A benchmark for evaluating LLMs on temporal reasoning
ToT-Semantic: A Synthetic Dataset
ToT-Arithmetic: A Temporal Arithmetic Dataset
Task Creation
Quality Check
Experiments and Results
Investigating the impact of temporal structure on LLM temporal reasoning
Influence of graph size on LLM performance
Effects of temporal question type on LLM temporal reasoning
Impact of temporal fact order on LLM performance
Temporal semantics vs temporal arithmetic
Conclusion
Acknowledgement
...and 24 more sections

Figures (5)

Figure 1: Comparison of the same temporal query using real (left) and anonymized (right) entity names. Gemini Advanced correctly answered the query with real names but failed with anonymized names, suggesting that LLMs might rely on their parametric knowledge to solve temporal tasks.
Figure 2: Steps for creating the ToT-Semantic dataset.
Figure 3: Steps for creating the ToT-Arithmetic dataset. The green and blue colors represent the operations done by the authors and the annotators respectively.
Figure 4: Accuracy of models for different number of edges and nodes.
Figure 5: A visualization of a representative graph from each graph generator: Erdős-Rényi (ER), Scale-Free Networks (SFN), Barabási–Albert (BA), Stochastic Block Model (SBM), star-graph, and complete-graph.

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

TL;DR

Abstract

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)