Table of Contents
Fetching ...

Time Machine GPT

Felix Drinkall, Eghbal Rahimikia, Janet B. Pierrehumbert, Stefan Zohren

TL;DR

TiMaGPT proposes a series of temporally constrained language models trained exclusively on historical data up to fixed cutoffs to prevent look-ahead bias in dynamic tasks. By pretraining yearly models on strictly pre-cutoff data and releasing accompanying datasets, the approach enables rigorous diachronic analysis and fair evaluation of language models across time, while revealing foresight leakage in conventional temporal adaptation methods. The authors detail data pipelines (Wikipedia year partitions and WMT News), sampling strategies that bias toward recent content, and a 2.5B-token, GPT-2 small pretraining regime that yields stable static-benchmark performance with a distinct temporal signature. This work provides a practical framework for studying language evolution and for deploying temporally faithful models in time-series forecasting and other dynamic settings.

Abstract

Large language models (LLMs) are often trained on extensive, temporally indiscriminate text corpora, reflecting the lack of datasets with temporal metadata. This approach is not aligned with the evolving nature of language. Conventional methods for creating temporally adapted language models often depend on further pre-training static models on time-specific data. This paper presents a new approach: a series of point-in-time LLMs called Time Machine GPT (TiMaGPT), specifically designed to be nonprognosticative. This ensures they remain uninformed about future factual information and linguistic changes. This strategy is beneficial for understanding language evolution and is of critical importance when applying models in dynamic contexts, such as time-series forecasting, where foresight of future information can prove problematic. We provide access to both the models and training datasets.

Time Machine GPT

TL;DR

TiMaGPT proposes a series of temporally constrained language models trained exclusively on historical data up to fixed cutoffs to prevent look-ahead bias in dynamic tasks. By pretraining yearly models on strictly pre-cutoff data and releasing accompanying datasets, the approach enables rigorous diachronic analysis and fair evaluation of language models across time, while revealing foresight leakage in conventional temporal adaptation methods. The authors detail data pipelines (Wikipedia year partitions and WMT News), sampling strategies that bias toward recent content, and a 2.5B-token, GPT-2 small pretraining regime that yields stable static-benchmark performance with a distinct temporal signature. This work provides a practical framework for studying language evolution and for deploying temporally faithful models in time-series forecasting and other dynamic settings.

Abstract

Large language models (LLMs) are often trained on extensive, temporally indiscriminate text corpora, reflecting the lack of datasets with temporal metadata. This approach is not aligned with the evolving nature of language. Conventional methods for creating temporally adapted language models often depend on further pre-training static models on time-specific data. This paper presents a new approach: a series of point-in-time LLMs called Time Machine GPT (TiMaGPT), specifically designed to be nonprognosticative. This ensures they remain uninformed about future factual information and linguistic changes. This strategy is beneficial for understanding language evolution and is of critical importance when applying models in dynamic contexts, such as time-series forecasting, where foresight of future information can prove problematic. We provide access to both the models and training datasets.
Paper Structure (29 sections, 5 equations, 4 figures, 3 tables)

This paper contains 29 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The perplexity of coronavirus and COVID-19, using TiMaGPT models (•) and Conventional Temporally Adapted (CTA) models ($\times$). The calculation for perplexity is outlined in Appendix \ref{['app:perp']} and the methodology for temporally adapting models is explained in Section \ref{['sec:eval']}. The CTA models have significant knowledge of these words before the pandemic.
  • Figure 2: Average perplexity of the names of country leaders around their year of inauguration, as measured using CTA models (Section 2.2) and TiMaGPT models.
  • Figure 3: Histogram of the publication of each Wikipedia article revision in the 2020 training dataset.
  • Figure 4: Number of occurrences of the words coronavirus and COVID-19 in the training datasets.