Table of Contents
Fetching ...

Set the Clock: Temporal Alignment of Pretrained Language Models

Bowen Zhao, Zander Brumbaugh, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith

TL;DR

The paper addresses the problem of temporal misalignment in pretrained language models by introducing TAQA, a large-scale dataset of time-sensitive questions (2000–2023) built from Wikipedia tables. It formalizes temporal QA and develops three alignment strategies—time-aware prompting, target-year finetuning, and temporal-adaptive finetuning—to bias models toward a chosen time, achieving substantial gains (up to 62.2% in target-year F1 for 2022) and even enabling historical-time alignment (up to 2.8x). Through comprehensive experiments, it shows that alignment benefits are larger for popular topics and are not solely due to memorization, while also analyzing data-selection effects and model-size scaling. The work demonstrates the feasibility and significance of tuning pretrained models’ internal temporal knowledge, with practical implications for keeping QA systems accurate and up-to-date across time. It also highlights limitations and ethical considerations for deploying temporally aware LMs in real-world applications.

Abstract

Language models (LMs) are trained on web text originating from many points in time and, in general, without any explicit temporal grounding. This work investigates the temporal chaos of pretrained LMs and explores various methods to align their internal knowledge to a target time, which we call "temporal alignment." To do this, we first automatically construct a dataset containing 20K time-sensitive questions and their answers for each year from 2000 to 2023. Based on this dataset, we empirically show that pretrained LMs (e.g., LLaMa2), despite having a recent pretraining cutoff (e.g., 2022), mostly answer questions using earlier knowledge (e.g., in 2019). We then develop several methods, from prompting to finetuning, to align LMs to use their most recent knowledge when answering questions, and investigate various factors in this alignment. Our experiments demonstrate that aligning LLaMa2 to the year 2022 can enhance its performance by up to 62% according to that year's answers. This improvement occurs even without explicitly mentioning time information, indicating the possibility of aligning models' internal sense of time after pretraining. Finally, we find that alignment to a historical time is also possible, with up to 2.8$\times$ the performance of the unaligned LM in 2010 if finetuning models to that year. These findings hint at the sophistication of LMs' internal knowledge organization and the necessity of tuning them properly.

Set the Clock: Temporal Alignment of Pretrained Language Models

TL;DR

The paper addresses the problem of temporal misalignment in pretrained language models by introducing TAQA, a large-scale dataset of time-sensitive questions (2000–2023) built from Wikipedia tables. It formalizes temporal QA and develops three alignment strategies—time-aware prompting, target-year finetuning, and temporal-adaptive finetuning—to bias models toward a chosen time, achieving substantial gains (up to 62.2% in target-year F1 for 2022) and even enabling historical-time alignment (up to 2.8x). Through comprehensive experiments, it shows that alignment benefits are larger for popular topics and are not solely due to memorization, while also analyzing data-selection effects and model-size scaling. The work demonstrates the feasibility and significance of tuning pretrained models’ internal temporal knowledge, with practical implications for keeping QA systems accurate and up-to-date across time. It also highlights limitations and ethical considerations for deploying temporally aware LMs in real-world applications.

Abstract

Language models (LMs) are trained on web text originating from many points in time and, in general, without any explicit temporal grounding. This work investigates the temporal chaos of pretrained LMs and explores various methods to align their internal knowledge to a target time, which we call "temporal alignment." To do this, we first automatically construct a dataset containing 20K time-sensitive questions and their answers for each year from 2000 to 2023. Based on this dataset, we empirically show that pretrained LMs (e.g., LLaMa2), despite having a recent pretraining cutoff (e.g., 2022), mostly answer questions using earlier knowledge (e.g., in 2019). We then develop several methods, from prompting to finetuning, to align LMs to use their most recent knowledge when answering questions, and investigate various factors in this alignment. Our experiments demonstrate that aligning LLaMa2 to the year 2022 can enhance its performance by up to 62% according to that year's answers. This improvement occurs even without explicitly mentioning time information, indicating the possibility of aligning models' internal sense of time after pretraining. Finally, we find that alignment to a historical time is also possible, with up to 2.8 the performance of the unaligned LM in 2010 if finetuning models to that year. These findings hint at the sophistication of LMs' internal knowledge organization and the necessity of tuning them properly.
Paper Structure (50 sections, 3 equations, 8 figures, 24 tables)

This paper contains 50 sections, 3 equations, 8 figures, 24 tables.

Figures (8)

  • Figure 1: Performance (F1 score) of various LMs on our TAQA dataset, by year. Unaligned LMs (left) and conventionally aligned models (upper right) show relatively stronger performance when measured by the answers in earlier years, with their predictions more scattered across time. Our temporal alignment methods (lower right) lead to improved performance closer to a recent time (here, 2022) with a higher peak. The dotted line between GPT-3 and ChatGPT implies an uncertain relation (the latter is not necessarily derived from the former).
  • Figure 2: The data construction process of our TAQA dataset.
  • Figure 3: The comparison of TAQA and existing time-sensitive QA datasets regarding the frequency of the answers' corresponding time. TAQA contains more data in the post-2000 era.
  • Figure 4: (Left) Relationship between question popularity (measured by the pageviews of their originated Wikipedia page) and models' F1 score on them, as in 2022. (Right) Relationship between testing questions' maximum semantic similarity to the training set and models' F1 score on them, as in 2022.
  • Figure 5: The temporal knowledge distribution of LMs finetuned with the NQ dataset and our TAQA dataset where the answers are randomly sampled.
  • ...and 3 more figures