Transformers generalize differently from information stored in context vs in weights

Stephanie C. Y. Chan; Ishita Dasgupta; Junkyung Kim; Dharshan Kumaran; Andrew K. Lampinen; Felix Hill

Transformers generalize differently from information stored in context vs in weights

Stephanie C. Y. Chan, Ishita Dasgupta, Junkyung Kim, Dharshan Kumaran, Andrew K. Lampinen, Felix Hill

TL;DR

This paper examines how transformers generalize from information stored in weights versus information provided in-context, revealing a strong rule-based bias for weights and an exemplar-based bias for context in synthetic settings. However, pretrained language models show that in-context generalization becomes partially rule-based, especially as model size increases. The authors demonstrate that pretraining on data requiring rule-based context generalization can shift in-context biases toward rule-based strategies. These findings suggest that natural language structure and model scale jointly influence inductive biases, with practical implications for deciding where to encode information (weights vs context).

Abstract

Transformer models can use two fundamentally different kinds of information: information stored in weights during training, and information provided ``in-context'' at inference time. In this work, we show that transformers exhibit different inductive biases in how they represent and generalize from the information in these two sources. In particular, we characterize whether they generalize via parsimonious rules (rule-based generalization) or via direct comparison with observed examples (exemplar-based generalization). This is of important practical consequence, as it informs whether to encode information in weights or in context, depending on how we want models to use that information. In transformers trained on controlled stimuli, we find that generalization from weights is more rule-based whereas generalization from context is largely exemplar-based. In contrast, we find that in transformers pre-trained on natural language, in-context learning is significantly rule-based, with larger models showing more rule-basedness. We hypothesise that rule-based generalization from in-context information might be an emergent consequence of large-scale training on language, which has sparse rule-like structure. Using controlled stimuli, we verify that transformers pretrained on data containing sparse rule-like structure exhibit more rule-based generalization.

Transformers generalize differently from information stored in context vs in weights

TL;DR

Abstract

Paper Structure (18 sections, 6 figures)

This paper contains 18 sections, 6 figures.

Introduction
Experimental Design
Results
Trained-from-scratch transformers
Generalization from in-weights information is rule-based.
Generalization from in-context information is exemplar-based.
Pretrained language models
In language models, generalization from in-context information is partially rule-based.
Smaller models are less rule-based.
In-context generalization can be made more rule-based with pre-training.
Conclusions
Experiment details: Trained-from-scratch transformers
Subvector stimuli
Pretraining for few-shot learning
Evaluating inductive biases
...and 3 more sections

Figures (6)

Figure 1: Partial exposure test for differentiating rule-based vs exemplar-based generalization. Stimuli have two features. The model sees three combinations (AX, AW, and BW) in training or in context (depending on experiment), and is evaluated on a held-out (test) combination BX. (\ref{['fig:explainer:rule_based']}) A rule-based model uses a parsimonious decision boundary that explains the data (here, based only on Feature 1), classifying the test as o. (\ref{['fig:explainer:exemplar_based']}) An exemplar-based model computes the similarity between test and training examples using all features. Since BX is equally similar to AX and BW, it is equally likely to classify it as * or o.
Figure 2: Generalization patterns of transformer models trained on synthetic data: frequency of various model outputs when presented with the held-out stimulus of the partial exposure paradigm (Fig \ref{['fig:explainer']}). (\ref{['fig:from_scratch:in-weights']}) Generalization from weights is completely rule-based. (\ref{['fig:from_scratch:in-context']}) In contrast, generalization from context is exemplar-based. (\ref{['fig:from_scratch:train_on_rule_based']}) The exemplar-based bias in in-context learning can be overcome by pretraining the model on sequences that explicitly require rule-based generalization.
Figure 3: Generalization from in-context information in a pretrained LM. We classify LM responses by whether it gives the label consistent with generalizing along color, shape, or neither. (\ref{['fig:LM:control']}) Measuring feature-level bias with the Control condition; the model prefers to generalize along color. We use these results as baselines for the partial exposure conditions. (\ref{['fig:LM:shape']}) When a sparse rule-based decision boundary supports shape as predictive, the model classifies along shape more often than in the baseline control (dotted line). (\ref{['fig:LM:shape']}) Similarly when color is predictive, the model classifies along color more often than in the baseline control (dotted line). (\ref{['fig:LM:sizes']}) Smaller LMs are less rule-based.
Figure 4: (\ref{['app:fig:transformer']}) Sequences of alternating stimuli and labels are passed to a transformer. Each sequence consists of a "context" (12 stimulus-label pairs) and a "query" stimulus. The model is trained to minimize the loss on the query prediction. (\ref{['app:fig:in_context_seqs']}) To evaluate generalization from context, the model is first pretrained to perform in-context learning by training on few-shot sequences; stimulus classes and labels are randomly chosen for every sequence, so that the model must perform few-shot learning from context. Inductive biases are evaluated on "partial exposure" sequences, where one combination is held out for evaluation ("BX"). Consistent selection of the label associated with "B" indicates a rule-based bias, while equal selection of the labels associated with "A" and "B" indicates an exemplar-based bias (since "BX" is equally similar to "AX" and "BW"). (\ref{['app:fig:stimulus_examples']}) Each stimulus consists of two subvectors concatenated together into a single token. Each subvector belongs a particular class, and each class is characterized by a different centroid. The subvectors are sampled from a multivariate normal centered on that centroid. The subvectors have length 32, but here we only show 4 values. (\ref{['app:fig:in_weights_seqs']}) To evaluate generalization from weights, the model is instead trained directly on partial exposure data, and inductive biases are evaluated on the held-out combination. (The context consisted of random samplings of the stimulus classes, and were irrelevant to the query prediction.)
Figure 5: To evaluate generalization from context in a pretrained language model, the model is evaluated on partial exposure sequences where the features are instead text features (shape and color words). The control condition allows us to evaluate the model's baseline bias towards shape or color.
...and 1 more figures

Transformers generalize differently from information stored in context vs in weights

TL;DR

Abstract

Transformers generalize differently from information stored in context vs in weights

Authors

TL;DR

Abstract

Table of Contents

Figures (6)