Table of Contents
Fetching ...

Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, Benjamin Van Durme

TL;DR

The paper analyzes the gap between claimed knowledge cutoffs for large language models and the actual temporal alignment of deployed training data. It introduces the concept of an effective cutoff, estimated by perplexity probes over time-spanning resource versions (WikiSpan and NewsSpan), and demonstrates that effective cutoffs often diverge from reported dates due to deduplication issues and temporal biases in CommonCrawl data. Through open-model experiments across C4, Pile, and RefinedWeb-based pretraining, the study shows varying degrees of alignment, with notable misalignments in non-Pile-derived models caused by near-duplicates and outdated web dumps. The work highlights practical implications for dataset curation and end-user interpretation of model knowledge, and suggests paths to improve provenance and alignment of knowledge cutoffs.

Abstract

Released Large Language Models (LLMs) are often paired with a claimed knowledge cutoff date, or the dates at which training data was gathered. Such information is crucial for applications where the LLM must provide up to date information. However, this statement only scratches the surface: do all resources in the training data share the same knowledge cutoff date? Does the model's demonstrated knowledge for these subsets closely align to their cutoff dates? In this work, we define the notion of an effective cutoff. This is distinct from the LLM designer reported cutoff and applies separately to sub-resources and topics. We propose a simple approach to estimate effective cutoffs on the resource-level temporal alignment of an LLM by probing across versions of the data. Using this analysis, we find that effective cutoffs often differ from reported cutoffs. To understand the root cause of this observation, we conduct a direct large-scale analysis on open pre-training datasets. Our analysis reveals two reasons for these inconsistencies: (1) temporal biases of CommonCrawl data due to non-trivial amounts of old data in new dumps and (2) complications in LLM deduplication schemes involving semantic duplicates and lexical near-duplicates. Overall, our results show that knowledge cutoffs are not as simple as they have seemed and that care must be taken both by LLM dataset curators as well as practitioners who seek to use information from these models.

Dated Data: Tracing Knowledge Cutoffs in Large Language Models

TL;DR

The paper analyzes the gap between claimed knowledge cutoffs for large language models and the actual temporal alignment of deployed training data. It introduces the concept of an effective cutoff, estimated by perplexity probes over time-spanning resource versions (WikiSpan and NewsSpan), and demonstrates that effective cutoffs often diverge from reported dates due to deduplication issues and temporal biases in CommonCrawl data. Through open-model experiments across C4, Pile, and RefinedWeb-based pretraining, the study shows varying degrees of alignment, with notable misalignments in non-Pile-derived models caused by near-duplicates and outdated web dumps. The work highlights practical implications for dataset curation and end-user interpretation of model knowledge, and suggests paths to improve provenance and alignment of knowledge cutoffs.

Abstract

Released Large Language Models (LLMs) are often paired with a claimed knowledge cutoff date, or the dates at which training data was gathered. Such information is crucial for applications where the LLM must provide up to date information. However, this statement only scratches the surface: do all resources in the training data share the same knowledge cutoff date? Does the model's demonstrated knowledge for these subsets closely align to their cutoff dates? In this work, we define the notion of an effective cutoff. This is distinct from the LLM designer reported cutoff and applies separately to sub-resources and topics. We propose a simple approach to estimate effective cutoffs on the resource-level temporal alignment of an LLM by probing across versions of the data. Using this analysis, we find that effective cutoffs often differ from reported cutoffs. To understand the root cause of this observation, we conduct a direct large-scale analysis on open pre-training datasets. Our analysis reveals two reasons for these inconsistencies: (1) temporal biases of CommonCrawl data due to non-trivial amounts of old data in new dumps and (2) complications in LLM deduplication schemes involving semantic duplicates and lexical near-duplicates. Overall, our results show that knowledge cutoffs are not as simple as they have seemed and that care must be taken both by LLM dataset curators as well as practitioners who seek to use information from these models.
Paper Structure (45 sections, 10 figures, 4 tables, 2 algorithms)

This paper contains 45 sections, 10 figures, 4 tables, 2 algorithms.

Figures (10)

  • Figure 1: LLMs may contain different versions of a dataset in their training data than what is specified in a "cutoff" date, misleading users and causing potential errors.
  • Figure 2: Perplexity of the Wiki document "Liverpool" under Pythia. Each point is the perplexity of the document at that time.
  • Figure 3: Relative perplexities of models per month using the NewsSpan (§\ref{['sec:time-spanning-datasets']}) dataset (we use relative as exact perplexities are not needed for determining effective cutoffs). We find that our approach identifies the effective cutoffs as the stated knowledge cutoff for NYT, as models have increased perplexity when their CommnonCrawl data dumps end in 2020.
  • Figure 4: Relative perplexities of models per month using the WikiSpan (§\ref{['sec:time-spanning-datasets']}) dataset. Upper plot shows Pile derived models, middle shows FalconRW derived models, while lower shows C4 derived models. The light grey bars indicate the distribution of Wikipedia-alike documents, matched to their closest version, as calculated in \ref{['sec:mining']}. In some cases the knowledge cutoff aligns with the model's effective cutoff (e.g. the Pile) while more recent models are aligned much earlier (e.g. RedPajamas to 2019, even though it has an explicit 2023 Wikipedia dump).
  • Figure 5: Relative perplexities of models in the Pythia (left) and LLaMA (right) suites. Darker colors indicate larger model size. While the smaller models have a more variable perplexity curve, they are still minimized at the same effective cutoff date.
  • ...and 5 more figures