Toward Understanding In-context vs. In-weight Learning

Bryan Chan; Xinyi Chen; András György; Dale Schuurmans

Toward Understanding In-context vs. In-weight Learning

Bryan Chan, Xinyi Chen, András György, Dale Schuurmans

TL;DR

This work seeks to explain when in-context learning (ICL) emerges in transformers and why it can fade with more training data. It introduces a simple gating model that combines an in-weight predictor $g$ and an in-context predictor $h$ via a selector $\alpha(\tilde{x})$, and proves, through generalization and regret analyses, how data distribution and sample size drive the emergence or transience of ICL versus IWL. The authors validate the theory with synthetic classifications and Omniglot experiments, showing ICL can appear under certain conditions and eventually be overtaken by IWL as more in-weight data becomes available; they also demonstrate that memorization in a real LLM can override ICL. The results argue that data learnability, not just distributional properties, governs ICL, with practical implications for training schedules and manipulating the ICL/IWL tradeoff in large models.

Abstract

It has recently been demonstrated empirically that in-context learning emerges in transformers when certain distributional properties are present in the training data, but this ability can also diminish upon further training. We provide a new theoretical understanding of these phenomena by identifying simplified distributional properties that give rise to the emergence and eventual disappearance of in-context learning. We do so by first analyzing a simplified model that uses a gating mechanism to choose between an in-weight and an in-context predictor. Through a combination of a generalization error and regret analysis we identify conditions where in-context and in-weight learning emerge. These theoretical findings are then corroborated experimentally by comparing the behaviour of a full transformer on the simplified distributions to that of the stylized model, demonstrating aligned results. We then extend the study to a full large language model, showing how fine-tuning on various collections of natural language prompts can elicit similar in-context and in-weight learning behaviour.

Toward Understanding In-context vs. In-weight Learning

TL;DR

This work seeks to explain when in-context learning (ICL) emerges in transformers and why it can fade with more training data. It introduces a simple gating model that combines an in-weight predictor

and an in-context predictor

via a selector

, and proves, through generalization and regret analyses, how data distribution and sample size drive the emergence or transience of ICL versus IWL. The authors validate the theory with synthetic classifications and Omniglot experiments, showing ICL can appear under certain conditions and eventually be overtaken by IWL as more in-weight data becomes available; they also demonstrate that memorization in a real LLM can override ICL. The results argue that data learnability, not just distributional properties, governs ICL, with practical implications for training schedules and manipulating the ICL/IWL tradeoff in large models.

Toward Understanding In-context vs. In-weight Learning

TL;DR

Abstract

Toward Understanding In-context vs. In-weight Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (11)