TiEBe: Tracking Language Model Recall of Notable Worldwide Events Through Time

Thales Sales Almeida; Giovana Kerche Bonás; João Guilherme Alves Santos; Hugo Abonizio; Rodrigo Nogueira

TiEBe: Tracking Language Model Recall of Notable Worldwide Events Through Time

Thales Sales Almeida, Giovana Kerche Bonás, João Guilherme Alves Santos, Hugo Abonizio, Rodrigo Nogueira

TL;DR

TiEBe introduces a large-scale, time- and region-aware benchmark for evaluating factual recall in LLMs using Wikipedia retrospective events across 23 regions and 13 languages over a decade. The methodology combines English-generated QA pairs with native-language translations and uses an LLM-as-Judge to assess answers, enabling cross-linguistic and temporal analysis. Key findings reveal strong regional and language disparities, with recall correlating with socioeconomic indicators, and notable performance gaps for low-resource languages. The work highlights the need for more equitable multilingual training data and broader event sources to improve global factual recall in LLMs.

Abstract

As the knowledge landscape evolves and large language models (LLMs) become increasingly widespread, there is a growing need to keep these models updated with current events. While existing benchmarks assess general factual recall, few studies explore how LLMs retain knowledge over time or across different regions. To address these gaps, we present the Timely Events Benchmark (TiEBe), a dataset of over 23,000 question-answer pairs centered on notable global and regional events, spanning more than 10 years of events, 23 regions, and 13 languages. TiEBe leverages structured retrospective data from Wikipedia to identify notable events through time. These events are then used to construct a benchmark to evaluate LLMs' understanding of global and regional developments, grounded in factual evidence beyond Wikipedia itself. Our results reveal significant geographic disparities in factual recall, emphasizing the need for more balanced global representation in LLM training. We also observe a Pearson correlation of more than 0.7 between models' performance in TiEBe and various countries' socioeconomic indicators, such as HDI. In addition, we examine the impact of language on factual recall by posing questions in the native language of the region where each event occurred, uncovering substantial performance gaps for low-resource languages.

TiEBe: Tracking Language Model Recall of Notable Worldwide Events Through Time

TL;DR

Abstract

Paper Structure (29 sections, 22 figures, 6 tables)

This paper contains 29 sections, 22 figures, 6 tables.

Introduction
Related work
Methodology
Data Collection
Generation of Question-Answer Pairs
Model Evaluation
LLM-as-Judge Performance
Results
Regional performance
Temporal performance
The effects of model language
Performance Correlation With Socioeconomic Indicators
Conclusion
Limitations and Future work
Execution details
...and 14 more sections

Figures (22)

Figure 1: Illustration of the pipeline used to build TiEBe.
Figure 2: Examples of generated question-answer pairs by country.
Figure 3: Performance of models per country under different subsets of TiEBe.
Figure 4: Accuracy and refusal rates over different time periods.
Figure 5: Difference in overall accuracy when prompted in English or the country native language. Negative value means the accuracy of the model was lower in the native than in English, while positives value indicate that models performed better in the native language.
...and 17 more figures

TiEBe: Tracking Language Model Recall of Notable Worldwide Events Through Time

TL;DR

Abstract

TiEBe: Tracking Language Model Recall of Notable Worldwide Events Through Time

Authors

TL;DR

Abstract

Table of Contents

Figures (22)