LLMLagBench: Identifying Temporal Training Boundaries in Large Language Models
Piotr Pęzik, Konrad Kaczyński, Maria Szymańska, Filip Żarnecki, Zuzanna Deckert, Jakub Kwiatkowski, Wojciech Janowski
TL;DR
LLMLagBench tackles the problem of identifying when an LLM's training data effectively ends, addressing risks of blending outdated knowledge with current reasoning. It builds a rigorous benchmark using densely sampled, time-stamped questions about 2021–2025 events, evaluated with a dedicated evaluator and PELT changepoint detection to locate training boundaries, including multiple partial cutoffs and refusals. The study reveals diverse cutoff patterns across models and often finds discrepancies between declared or self-reported cutoffs and empirical boundaries, underscoring the need for independent validation. The work enables practical assessment of knowledge freshness and motivates extensions such as regional knowledge retention and continuous benchmarking for LLMs.
Abstract
Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff. This creates a strict knowledge boundary beyond which models cannot provide accurate information without querying external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend outdated time-sensitive information with general knowledge during reasoning tasks, potentially compromising response accuracy. We introduce LLMLagBench, an LLM freshness benchmark, as a systematic approach for identifying the earliest probable temporal boundaries of an LLM's training data by evaluating its knowledge of recent events. We then apply this benchmark to evaluate a large set of LLMs, including models with both explicitly declared and undeclared training cutoffs. The reliability of the benchmark is assessed by manual validation and comparison with publicly released information about LLM pretraining.
