Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time
David Herel, Vojtech Bartek, Jiri Jirak, Tomas Mikolov
TL;DR
This paper targets the problem of time-sensitive factual recall in large language models by introducing TimeShift, a benchmark and evaluation framework built on a day-level, paraphrase-rich dataset of over 8,000 events from 2018–2024. The approach assesses models via log-probability ranking over temporal prefixes, enabling robust temporal reasoning beyond static facts. Key findings show that base, non-instruction-tuned models often surpass instruction-tuned and synthetic-trained counterparts, while even large models exhibit brittleness under paraphrase, underscoring unresolved challenges in temporal robustness. By publicly releasing data, code, and evaluation tools, the work provides a concrete resource to advance time-aware LLMs for real-world applications like real-time fact-checking and temporal QA.
Abstract
Who is the US President? The answer changes depending on when the question is asked. While large language models (LLMs) are evaluated on various reasoning tasks, they often miss a crucial dimension: time. In real-world scenarios, the correctness of answers is frequently tied to temporal context. To address this gap, we present a novel framework and dataset spanning over 8,000 events from 2018 to 2024, annotated with day-level granularity and sourced globally across domains such as politics, science, and business. Our TimeShift evaluation method systematically probes LLMs for temporal reasoning, revealing that base models often outperform instruction-tuned and synthetic-trained counterparts on time-sensitive recall. Additionally, we find that even large-scale models exhibit brittleness in handling paraphrased facts, highlighting unresolved challenges in temporal consistency. By identifying these limitations, our work provides a significant step toward advancing time-aware language models capable of adapting to the dynamic nature of real-world knowledge.
