Time is Encoded in the Weights of Finetuned Language Models

Kai Nylund; Suchin Gururangan; Noah A. Smith

Time is Encoded in the Weights of Finetuned Language Models

Kai Nylund, Suchin Gururangan, Noah A. Smith

TL;DR

The paper addresses temporal misalignment in language models by introducing time vectors, which are computed as $\ au_t = \\theta_t - \\theta_{pre}$ to capture how finetuning on a single time period shifts weights. These vectors enable weight-space interpolation to handle intervening and future time periods, and they reveal that time is organized as a manifold in weight space, with closer times yielding more similar vectors. The authors demonstrate linear yearly degradation, seasonal monthly patterns, and a strong relationship between time-vector similarity and temporal degradation across tasks and model sizes. They further show that interpolating between time vectors improves performance on unseen times and that task analogies can update models to future times using unlabeled data, though multi-year model soups do not outperform training on all data; the work provides practical, scalable tools for temporally aware language modeling and contributes a new perspective on how time is represented in neural weight spaces.

Abstract

We present time vectors, a simple tool to customize language models to new time periods. Time vectors are created by finetuning a language model on data from a single time (e.g., a year or month), and then subtracting the weights of the original pretrained model. This vector specifies a direction in weight space that, as our experiments show, improves performance on text from that time period. Time vectors specialized to adjacent time periods appear to be positioned closer together in a manifold. Using this structure, we interpolate between time vectors to induce new models that perform better on intervening and future time periods, without any additional training. We demonstrate the consistency of our findings across different tasks, domains, model sizes, and time scales. Our results suggest that time is encoded in the weight space of finetuned models.

Time is Encoded in the Weights of Finetuned Language Models

TL;DR

The paper addresses temporal misalignment in language models by introducing time vectors, which are computed as

to capture how finetuning on a single time period shifts weights. These vectors enable weight-space interpolation to handle intervening and future time periods, and they reveal that time is organized as a manifold in weight space, with closer times yielding more similar vectors. The authors demonstrate linear yearly degradation, seasonal monthly patterns, and a strong relationship between time-vector similarity and temporal degradation across tasks and model sizes. They further show that interpolating between time vectors improves performance on unseen times and that task analogies can update models to future times using unlabeled data, though multi-year model soups do not outperform training on all data; the work provides practical, scalable tools for temporally aware language modeling and contributes a new perspective on how time is represented in neural weight spaces.

Abstract

Paper Structure (37 sections, 1 equation, 16 figures, 5 tables)

This paper contains 37 sections, 1 equation, 16 figures, 5 tables.

Introduction
Data and Finetuning
Datasets
Language Modeling
Downstream Tasks
Finetuning
Revealing Temporal Misalignment at Multiple Time Scales
Yearly Degradation is Linear
Monthly Degradation is Seasonal
Summary
Temporal Adaptation with Time Vectors
Background and Definition
Correlation of Time Vector Similarity and Temporal Degradation
Generalizing to Intervening Time Periods
Method
...and 22 more sections

Figures (16)

Figure 1: We present time vectors, a simple tool to customize language models to new time periods. Time vectors ($\tau_i$) specify a direction in weight space that improves performance on text from a time period $i$. They are computed by subtracting the pretrained weights ($\theta_{\text{pre}}$; left panel) from those finetuned to a target time period ($\theta_i$). We can customize model behavior to new time periods (e.g., intervening months or years) by interpolating between time vectors and adding the result to the pretrained model (middle panel). We can also generalize to a future time period $j$ with analogy arithmetic (right panel). This involves combining a task-specific time vector with analogous time vectors derived from finetuned language models ($\tau^{\text{LM}}_j$).
Figure 2: Model performance degrades linearly year-to-year. We evaluate language model perplexity (WMT), ROUGE-L (news summarization), and macro F1 (political affiliation classification). Each cell indicates the monthly performance of T5-3B finetuned and evaluated on a single year from that task. We report the percentage difference from the average performance for each year, and find linear degradation as finetuning and evaluation years become more misaligned regardless of task. We display similar trends for T5-small and medium, as well as for other domains and tasks, in §\ref{['subsec:other_yearly_misalignment']}. We measure the linearity of these degradations in Appendix Table \ref{['table:td_scores']}.
Figure 3: Monthly temporal degradation has seasonal patterns. Each cell indicates the monthly performance of T5-small finetuned and evaluated on a single month of the WMT dataset. We report the percentage difference in test perplexity from the average on the evaluation month over all finetuned T5-small models (darker is better). The diagonal indicates that each model does best on its finetuning month. Models also do relatively better on the same month in other years, visible as the stripes radiating out from the diagonal every 12 months.
Figure 4: Time vectors are organized in a manifold that reflects temporal variation. Each point is a UMAP projection of the last feedforward layer of a T5-small time vector finetuned on single month of WMT. Points and edges between adjacent months are colored by year. Distances between the weights of time vectors correlate with temporal misalignment (§\ref{['subsec:cos_sim_static']}).
Figure 5: Interpolating between two year vectors improves performance on the years between them. These performance improvements follow an intuitive structure, e.g. when interpolating between 2012 and 2016, the best result on 2013 occurs with a higher percentage of 2012 and vice versa for 2015. Improvement from interpolation varies across settings.
...and 11 more figures

Time is Encoded in the Weights of Finetuned Language Models

TL;DR

Abstract

Time is Encoded in the Weights of Finetuned Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (16)