Table of Contents
Fetching ...

Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models

Amal Rannen-Triki, Jorg Bornschein, Razvan Pascanu, Marcus Hutter, Andras György, Alexandre Galashov, Yee Whye Teh, Michalis K. Titsias

TL;DR

This empirical study provides insights on when online adaptation is particularly interesting and highlights that with online adaptation the conceptual distinction between in-context learning and fine tuning blurs: both are methods to condition the model on previously observed tokens.

Abstract

We consider the problem of online fine tuning the parameters of a language model at test time, also known as dynamic evaluation. While it is generally known that this approach improves the overall predictive performance, especially when considering distributional shift between training and evaluation data, we here emphasize the perspective that online adaptation turns parameters into temporally changing states and provides a form of context-length extension with memory in weights, more in line with the concept of memory in neuroscience. We pay particular attention to the speed of adaptation (in terms of sample efficiency),sensitivity to the overall distributional drift, and the computational overhead for performing gradient computations and parameter updates. Our empirical study provides insights on when online adaptation is particularly interesting. We highlight that with online adaptation the conceptual distinction between in-context learning and fine tuning blurs: both are methods to condition the model on previously observed tokens.

Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models

TL;DR

This empirical study provides insights on when online adaptation is particularly interesting and highlights that with online adaptation the conceptual distinction between in-context learning and fine tuning blurs: both are methods to condition the model on previously observed tokens.

Abstract

We consider the problem of online fine tuning the parameters of a language model at test time, also known as dynamic evaluation. While it is generally known that this approach improves the overall predictive performance, especially when considering distributional shift between training and evaluation data, we here emphasize the perspective that online adaptation turns parameters into temporally changing states and provides a form of context-length extension with memory in weights, more in line with the concept of memory in neuroscience. We pay particular attention to the speed of adaptation (in terms of sample efficiency),sensitivity to the overall distributional drift, and the computational overhead for performing gradient computations and parameter updates. Our empirical study provides insights on when online adaptation is particularly interesting. We highlight that with online adaptation the conceptual distinction between in-context learning and fine tuning blurs: both are methods to condition the model on previously observed tokens.
Paper Structure (14 sections, 13 figures)

This paper contains 14 sections, 13 figures.

Figures (13)

  • Figure 1: Left: Cumulative log-loss for dynamic evaluation relative to static evaluation (regret). The starting point is always a model that has been finetuned on the PG-19 training set. Right: Detailed view of the regret for the first 5 books. Vertical green lines indicate the beginning of new books.
  • Figure 2: Regret plot of Transformer-XL style online learning with varying increment-size relative to Overlapping online learning with 0.5 overlap. We observe that Transformer-XL style online learning generally leads to 20k to 70k fewer accumulated loss. However, 70k nats over 11.8M tokens corresponds to only about 0.006 nat/token uplift -- a minuscule improvement compared to the differences plotted in Figures \ref{['fig:regret']} to \ref{['fig:lora_regret']}.
  • Figure 3: Performance vs. compute (FLOPs) for static and dynamic evaluation. Models with 1 billion parameters, varying the context size and the number of finetuning samples (books). The Pareto front is constructed by varying the update frequency.
  • Figure 4: Performance vs. compute (FLOPs) for static and dynamic evaluation. Varying model and context sizes, and the number of finetuning samples. The models are updated with every new observation.
  • Figure 5: Left: Scaling of the average PG-19 test loss with size of PG-19 i.i.d. finetuning dataset (for the 400 M model). Right: Average test NLL as a function of the model size (after finetuning on 10k books)
  • ...and 8 more figures