Table of Contents
Fetching ...

TiEBe: Tracking Language Model Recall of Notable Worldwide Events Through Time

Thales Sales Almeida, Giovana Kerche Bonás, João Guilherme Alves Santos, Hugo Abonizio, Rodrigo Nogueira

TL;DR

TiEBe introduces a large-scale, time- and region-aware benchmark for evaluating factual recall in LLMs using Wikipedia retrospective events across 23 regions and 13 languages over a decade. The methodology combines English-generated QA pairs with native-language translations and uses an LLM-as-Judge to assess answers, enabling cross-linguistic and temporal analysis. Key findings reveal strong regional and language disparities, with recall correlating with socioeconomic indicators, and notable performance gaps for low-resource languages. The work highlights the need for more equitable multilingual training data and broader event sources to improve global factual recall in LLMs.

Abstract

As the knowledge landscape evolves and large language models (LLMs) become increasingly widespread, there is a growing need to keep these models updated with current events. While existing benchmarks assess general factual recall, few studies explore how LLMs retain knowledge over time or across different regions. To address these gaps, we present the Timely Events Benchmark (TiEBe), a dataset of over 23,000 question-answer pairs centered on notable global and regional events, spanning more than 10 years of events, 23 regions, and 13 languages. TiEBe leverages structured retrospective data from Wikipedia to identify notable events through time. These events are then used to construct a benchmark to evaluate LLMs' understanding of global and regional developments, grounded in factual evidence beyond Wikipedia itself. Our results reveal significant geographic disparities in factual recall, emphasizing the need for more balanced global representation in LLM training. We also observe a Pearson correlation of more than 0.7 between models' performance in TiEBe and various countries' socioeconomic indicators, such as HDI. In addition, we examine the impact of language on factual recall by posing questions in the native language of the region where each event occurred, uncovering substantial performance gaps for low-resource languages.

TiEBe: Tracking Language Model Recall of Notable Worldwide Events Through Time

TL;DR

TiEBe introduces a large-scale, time- and region-aware benchmark for evaluating factual recall in LLMs using Wikipedia retrospective events across 23 regions and 13 languages over a decade. The methodology combines English-generated QA pairs with native-language translations and uses an LLM-as-Judge to assess answers, enabling cross-linguistic and temporal analysis. Key findings reveal strong regional and language disparities, with recall correlating with socioeconomic indicators, and notable performance gaps for low-resource languages. The work highlights the need for more equitable multilingual training data and broader event sources to improve global factual recall in LLMs.

Abstract

As the knowledge landscape evolves and large language models (LLMs) become increasingly widespread, there is a growing need to keep these models updated with current events. While existing benchmarks assess general factual recall, few studies explore how LLMs retain knowledge over time or across different regions. To address these gaps, we present the Timely Events Benchmark (TiEBe), a dataset of over 23,000 question-answer pairs centered on notable global and regional events, spanning more than 10 years of events, 23 regions, and 13 languages. TiEBe leverages structured retrospective data from Wikipedia to identify notable events through time. These events are then used to construct a benchmark to evaluate LLMs' understanding of global and regional developments, grounded in factual evidence beyond Wikipedia itself. Our results reveal significant geographic disparities in factual recall, emphasizing the need for more balanced global representation in LLM training. We also observe a Pearson correlation of more than 0.7 between models' performance in TiEBe and various countries' socioeconomic indicators, such as HDI. In addition, we examine the impact of language on factual recall by posing questions in the native language of the region where each event occurred, uncovering substantial performance gaps for low-resource languages.
Paper Structure (29 sections, 22 figures, 6 tables)

This paper contains 29 sections, 22 figures, 6 tables.

Figures (22)

  • Figure 1: Illustration of the pipeline used to build TiEBe.
  • Figure 2: Examples of generated question-answer pairs by country.
  • Figure 3: Performance of models per country under different subsets of TiEBe.
  • Figure 4: Accuracy and refusal rates over different time periods.
  • Figure 5: Difference in overall accuracy when prompted in English or the country native language. Negative value means the accuracy of the model was lower in the native than in English, while positives value indicate that models performed better in the native language.
  • ...and 17 more figures