Table of Contents
Fetching ...

Toward Understanding In-context vs. In-weight Learning

Bryan Chan, Xinyi Chen, András György, Dale Schuurmans

TL;DR

This work seeks to explain when in-context learning (ICL) emerges in transformers and why it can fade with more training data. It introduces a simple gating model that combines an in-weight predictor $g$ and an in-context predictor $h$ via a selector $\alpha(\tilde{x})$, and proves, through generalization and regret analyses, how data distribution and sample size drive the emergence or transience of ICL versus IWL. The authors validate the theory with synthetic classifications and Omniglot experiments, showing ICL can appear under certain conditions and eventually be overtaken by IWL as more in-weight data becomes available; they also demonstrate that memorization in a real LLM can override ICL. The results argue that data learnability, not just distributional properties, governs ICL, with practical implications for training schedules and manipulating the ICL/IWL tradeoff in large models.

Abstract

It has recently been demonstrated empirically that in-context learning emerges in transformers when certain distributional properties are present in the training data, but this ability can also diminish upon further training. We provide a new theoretical understanding of these phenomena by identifying simplified distributional properties that give rise to the emergence and eventual disappearance of in-context learning. We do so by first analyzing a simplified model that uses a gating mechanism to choose between an in-weight and an in-context predictor. Through a combination of a generalization error and regret analysis we identify conditions where in-context and in-weight learning emerge. These theoretical findings are then corroborated experimentally by comparing the behaviour of a full transformer on the simplified distributions to that of the stylized model, demonstrating aligned results. We then extend the study to a full large language model, showing how fine-tuning on various collections of natural language prompts can elicit similar in-context and in-weight learning behaviour.

Toward Understanding In-context vs. In-weight Learning

TL;DR

This work seeks to explain when in-context learning (ICL) emerges in transformers and why it can fade with more training data. It introduces a simple gating model that combines an in-weight predictor and an in-context predictor via a selector , and proves, through generalization and regret analyses, how data distribution and sample size drive the emergence or transience of ICL versus IWL. The authors validate the theory with synthetic classifications and Omniglot experiments, showing ICL can appear under certain conditions and eventually be overtaken by IWL as more in-weight data becomes available; they also demonstrate that memorization in a real LLM can override ICL. The results argue that data learnability, not just distributional properties, governs ICL, with practical implications for training schedules and manipulating the ICL/IWL tradeoff in large models.

Abstract

It has recently been demonstrated empirically that in-context learning emerges in transformers when certain distributional properties are present in the training data, but this ability can also diminish upon further training. We provide a new theoretical understanding of these phenomena by identifying simplified distributional properties that give rise to the emergence and eventual disappearance of in-context learning. We do so by first analyzing a simplified model that uses a gating mechanism to choose between an in-weight and an in-context predictor. Through a combination of a generalization error and regret analysis we identify conditions where in-context and in-weight learning emerge. These theoretical findings are then corroborated experimentally by comparing the behaviour of a full transformer on the simplified distributions to that of the stylized model, demonstrating aligned results. We then extend the study to a full large language model, showing how fine-tuning on various collections of natural language prompts can elicit similar in-context and in-weight learning behaviour.

Paper Structure

This paper contains 37 sections, 6 theorems, 25 equations, 21 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

Given an example sequence $\tilde{x}$with one-hot label $y$, let $k$ denote the number of irrelevant labels in the context. Suppose for all $x\in {\mathcal{X}}$, $\|x\| \le 1$, then the prediction of any $h\in {\mathcal{H}}$ satisfies

Figures (21)

  • Figure 1: The theoretical error bounds of IC and IW predictors. We set $L = 8$, ${\epsilon} = 0.001$, $B = 1$, $C = 10$, and $y^*(x) = [0.999, 0.001/9, \dots, 0.001/9]$. As the number of irrelevant contexts, $k$, increases the lower and upper bounds of IC error also increase, whereas where the number of samples, $N_x$, increases the lower and upper bounds of the IW error decrease. Consequently one can expect ICL to be transient as we observe more samples.
  • Figure 2: 0-1 validation errors of the IC predictor, IW predictor, and transformer as a function of training set size $N$ on the synthetic data, over five seeds. $L = 1$, $p_{high} = 0.9$, $p_{relevant} = 0.9$ and $\sigma=0.2$. The columns correspond to test data with relevant/irrelevant context, and classes from the high-frequency (denoted $C_H$) or low-frequency classes (denoted $C_L$). The top row shows IBD error on the specified conditional data distribution, while the bottom row shows OOBD error. ICL diminishes as $N$ increases and IWL and ICL can emerge simultaneously.
  • Figure 3: 0-1 validation errors of the IC predictor, IW predictor, and transformer as a function of training set size $N$ on the synthetic data, over five seeds.
  • Figure 4: 0-1 validation errors, conditioned on relevant contexts and low-frequency classes $C_L$, of the IC predictor, IW predictor, and transformer as a function of training set size $N$ on the synthetic data, over five seeds. $L = 4$, $p_{high} = 0.9$, $p_{relevant} = 0.9$, and $L_{relevant} = \{1, 2, 3, 4\}$. The transformer exhibits stronger ICL as the number of relevant contexts $L_{relevant}$ increases.
  • Figure 5: 0-1 validation errors as a function of the dataset size $N$ on Omniglot, over three seeds. $L = 2$, $p_{high} = 0.9$, $p_{relevant} = 0.9$, and $\sigma = 0.0$. The transformer exhibits ICL on low-frequency classes $C_L$ initially but loses said capability as $N$ increases.
  • ...and 16 more figures

Theorems & Definitions (11)

  • Proposition 1
  • Corollary 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • proof : Proof of Proposition \ref{['prop:iwl_generalization']}
  • proof : Proof of Proposition \ref{['prop:l1_norm_bound']}
  • proof : Proof of Corollary \ref{['cor:ce_lower_bound']}
  • proof : Proof of Proposition \ref{['prop:regret_two_level']}
  • Proposition 5
  • ...and 1 more