Table of Contents
Fetching ...

The Transient Nature of Emergent In-Context Learning in Transformers

Aaditya K. Singh, Stephanie C. Y. Chan, Ted Moskovitz, Erin Grant, Andrew M. Saxe, Felix Hill

TL;DR

This work challenges the prevailing assumption that in-context learning (ICL) in transformers, once it emerges, persists under continued training. Using controlled synthetic data that can be solved either by ICL or by in-weights learning (IWL), the authors show that ICL typically rises and then fades as training proceeds, even as loss decreases and IWL gains. The transience persists across model sizes, dataset sizes, and even when extending to language-model token embeddings, though factors like many classes and Zipfian data distributions can delay decay. Regularization, especially L2, can stabilize or even eliminate transience, while competition between ICL and IWL circuits in the residual stream offers a potential mechanistic explanation. These findings have practical implications for training compact transformers and underscore the need for validation strategies that preserve flexible, context-dependent learning properties.

Abstract

Transformer neural networks can exhibit a surprising capacity for in-context learning (ICL) despite not being explicitly trained for it. Prior work has provided a deeper understanding of how ICL emerges in transformers, e.g. through the lens of mechanistic interpretability, Bayesian inference, or by examining the distributional properties of training data. However, in each of these cases, ICL is treated largely as a persistent phenomenon; namely, once ICL emerges, it is assumed to persist asymptotically. Here, we show that the emergence of ICL during transformer training is, in fact, often transient. We train transformers on synthetic data designed so that both ICL and in-weights learning (IWL) strategies can lead to correct predictions. We find that ICL first emerges, then disappears and gives way to IWL, all while the training loss decreases, indicating an asymptotic preference for IWL. The transient nature of ICL is observed in transformers across a range of model sizes and datasets, raising the question of how much to "overtrain" transformers when seeking compact, cheaper-to-run models. We find that L2 regularization may offer a path to more persistent ICL that removes the need for early stopping based on ICL-style validation tasks. Finally, we present initial evidence that ICL transience may be caused by competition between ICL and IWL circuits.

The Transient Nature of Emergent In-Context Learning in Transformers

TL;DR

This work challenges the prevailing assumption that in-context learning (ICL) in transformers, once it emerges, persists under continued training. Using controlled synthetic data that can be solved either by ICL or by in-weights learning (IWL), the authors show that ICL typically rises and then fades as training proceeds, even as loss decreases and IWL gains. The transience persists across model sizes, dataset sizes, and even when extending to language-model token embeddings, though factors like many classes and Zipfian data distributions can delay decay. Regularization, especially L2, can stabilize or even eliminate transience, while competition between ICL and IWL circuits in the residual stream offers a potential mechanistic explanation. These findings have practical implications for training compact transformers and underscore the need for validation strategies that preserve flexible, context-dependent learning properties.

Abstract

Transformer neural networks can exhibit a surprising capacity for in-context learning (ICL) despite not being explicitly trained for it. Prior work has provided a deeper understanding of how ICL emerges in transformers, e.g. through the lens of mechanistic interpretability, Bayesian inference, or by examining the distributional properties of training data. However, in each of these cases, ICL is treated largely as a persistent phenomenon; namely, once ICL emerges, it is assumed to persist asymptotically. Here, we show that the emergence of ICL during transformer training is, in fact, often transient. We train transformers on synthetic data designed so that both ICL and in-weights learning (IWL) strategies can lead to correct predictions. We find that ICL first emerges, then disappears and gives way to IWL, all while the training loss decreases, indicating an asymptotic preference for IWL. The transient nature of ICL is observed in transformers across a range of model sizes and datasets, raising the question of how much to "overtrain" transformers when seeking compact, cheaper-to-run models. We find that L2 regularization may offer a path to more persistent ICL that removes the need for early stopping based on ICL-style validation tasks. Finally, we present initial evidence that ICL transience may be caused by competition between ICL and IWL circuits.
Paper Structure (21 sections, 16 figures)

This paper contains 21 sections, 16 figures.

Figures (16)

  • Figure 1: In-context learning is transient, shown for our "default" settings: 12 layers, embedding dimension of 64, trained on 1,600 classes, with 20 exemplars per class. All training sequences are bursty (see Figure \ref{['fig:methods:data']} for details). Chan2022 found these settings to strongly incentivize ICL, but did not observe ICL transience (see Figure \ref{['fig:burstiness']}), as they did not train long enough. (\ref{['fig:main_result:a']}) ICL evaluator accuracy. (\ref{['fig:main_result:b']}) IWL evaluator accuracy. We note that, while accuracy on train sequences is 100%, accuracy on the IWL evaluator is very slowly increasing, as the test sequences are out-of-distribution. See Appendix \ref{['appendix:evals']} for further investigation. (\ref{['fig:main_result:c']}) Training log loss. Two colors indicate two seeds used for experiments.
  • Figure 2: An overview of our setup. (\ref{['fig:methods:data']}) Example sequences during training, ICL evaluation, and IWL evaluation. Example outputs are colored green when correct and red when incorrect. Note that for train sequences, ICL and IWL strategies both result in the correct answer. On ICL eval sequences, the IWL prediction is incorrect. ICL is required to remap exemplars to the randomized 0 or 1 labels. On IWL eval sequences, there are no matching exemplar-label pairs in context, so IWL is necessary. (\ref{['fig:methods:model']}) Model schematic. Training and evaluation focuses on the predicted label for the final exemplar.
  • Figure 3: (Reproduced with permission from Chan2022.) It was previously shown that ICL can be transient (purple and red curves) when transformers are trained on data that is only weakly conducive to ICL (e.g., with low levels of burstiness). P(bursty) indicates the fraction of training sequences that are bursty. All curves in this figure train on a dataset with 1,600 classes and 20 exemplars. Note the $x$-axis scale: At this number of iterations, there is no sign of ICL transience for the highest level of burstiness, but our experiments are run for a much larger number of steps. We also note that, when P(bursty) < 1, some of the sequences during training are of the same form as the IWL evaluation sequences. As a result, the IWL evaluator is less out-of-distribution, which is why the accuracies in (\ref{['fig:bursty:iwl']}), when P(bursty) < 1, are so high.
  • Figure 4: (\ref{['fig:depth:icl']}-\ref{['fig:depth:iwl']}) ICL is transient regardless of model depth, with no clear trend of peak height or peak onset. Decay slopes are roughly similar across model sizes.
  • Figure 5: In-class variation improves ICL, as previously known and seen here by the higher peak heights, but ICL is nonetheless transient across settings.
  • ...and 11 more figures