Table of Contents
Fetching ...

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time

David Herel, Vojtech Bartek, Jiri Jirak, Tomas Mikolov

TL;DR

This paper targets the problem of time-sensitive factual recall in large language models by introducing TimeShift, a benchmark and evaluation framework built on a day-level, paraphrase-rich dataset of over 8,000 events from 2018–2024. The approach assesses models via log-probability ranking over temporal prefixes, enabling robust temporal reasoning beyond static facts. Key findings show that base, non-instruction-tuned models often surpass instruction-tuned and synthetic-trained counterparts, while even large models exhibit brittleness under paraphrase, underscoring unresolved challenges in temporal robustness. By publicly releasing data, code, and evaluation tools, the work provides a concrete resource to advance time-aware LLMs for real-world applications like real-time fact-checking and temporal QA.

Abstract

Who is the US President? The answer changes depending on when the question is asked. While large language models (LLMs) are evaluated on various reasoning tasks, they often miss a crucial dimension: time. In real-world scenarios, the correctness of answers is frequently tied to temporal context. To address this gap, we present a novel framework and dataset spanning over 8,000 events from 2018 to 2024, annotated with day-level granularity and sourced globally across domains such as politics, science, and business. Our TimeShift evaluation method systematically probes LLMs for temporal reasoning, revealing that base models often outperform instruction-tuned and synthetic-trained counterparts on time-sensitive recall. Additionally, we find that even large-scale models exhibit brittleness in handling paraphrased facts, highlighting unresolved challenges in temporal consistency. By identifying these limitations, our work provides a significant step toward advancing time-aware language models capable of adapting to the dynamic nature of real-world knowledge.

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time

TL;DR

This paper targets the problem of time-sensitive factual recall in large language models by introducing TimeShift, a benchmark and evaluation framework built on a day-level, paraphrase-rich dataset of over 8,000 events from 2018–2024. The approach assesses models via log-probability ranking over temporal prefixes, enabling robust temporal reasoning beyond static facts. Key findings show that base, non-instruction-tuned models often surpass instruction-tuned and synthetic-trained counterparts, while even large models exhibit brittleness under paraphrase, underscoring unresolved challenges in temporal robustness. By publicly releasing data, code, and evaluation tools, the work provides a concrete resource to advance time-aware LLMs for real-world applications like real-time fact-checking and temporal QA.

Abstract

Who is the US President? The answer changes depending on when the question is asked. While large language models (LLMs) are evaluated on various reasoning tasks, they often miss a crucial dimension: time. In real-world scenarios, the correctness of answers is frequently tied to temporal context. To address this gap, we present a novel framework and dataset spanning over 8,000 events from 2018 to 2024, annotated with day-level granularity and sourced globally across domains such as politics, science, and business. Our TimeShift evaluation method systematically probes LLMs for temporal reasoning, revealing that base models often outperform instruction-tuned and synthetic-trained counterparts on time-sensitive recall. Additionally, we find that even large-scale models exhibit brittleness in handling paraphrased facts, highlighting unresolved challenges in temporal consistency. By identifying these limitations, our work provides a significant step toward advancing time-aware language models capable of adapting to the dynamic nature of real-world knowledge.
Paper Structure (24 sections, 1 equation, 9 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 1 equation, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Temporal log probabilities of sentences predicting the U.S. president (Joe Biden or Donald Trump) using Llama 3.2 3B, showing a clear shift in predictions aligned with their terms. As the model's training data cuts off at the end of 2023, predictions beyond this point reflect extrapolated trends.
  • Figure 2: World map showing the amount of news per country, US is in the first place with over 3,700 events across the 7 years.
  • Figure 3: Distribution of events across categories, showing the highest concentration in Politics & Government and Crime & Law categories.
  • Figure 4: Even distribution of events across years, months, and days, ensuring balanced temporal coverage for evaluation.
  • Figure 5: Schema of the TimeShift algorithm. Nodes represent sentences for which probabilities are computed with varying temporal prefixes (in blue). The sentence with the highest probability is selected as the prediction.
  • ...and 4 more figures