Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting

Suraj Anand; Michael A. Lepori; Jack Merullo; Ellie Pavlick

Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting

Suraj Anand, Michael A. Lepori, Jack Merullo, Ellie Pavlick

TL;DR

The paper investigates how language models can flexibly deploy in-context learning (ICL) and in-weights learning (IWL) by introducing structural ICL, a form of ICL invariant to token embeddings. It shows that structural ICL is transient during pretraining in both naturalistic and synthetic settings, as models encode information into weights, but that forgetting-based interventions—active forgetting, temporary forgetting, and probabilistic temporary forgetting—can sustain or induce dual-process behavior where ICL handles unseen or tail tokens while IWL covers frequent tokens. Active forgetting preserves structural ICL at the cost of IWL, whereas temporary forgetting enables a controllable balance, producing robust dual processing across distributions; probabilistic temporary forgetting extends this capability to pretrained models like GPT-2. The findings offer a practical pathway to robust generalization and token-type specialization, with implications for curriculum design, long-horizon pretraining, and efficient fine-tuning in skewed textual domains, enabling models to both memorize common patterns and generalize to rare or novel tokens.

Abstract

Language models have the ability to perform in-context learning (ICL), allowing them to flexibly adapt their behavior based on context. This contrasts with in-weights learning (IWL), where memorized information is encoded in model parameters after iterated observations of data. An ideal model should be able to flexibly deploy both of these abilities. Despite their apparent ability to learn in-context, language models are known to struggle when faced with unseen or rarely seen tokens (Land & Bartolo, 2024). Hence, we study $\textbf{structural in-context learning}$, which we define as the ability of a model to execute in-context learning on arbitrary novel tokens -- so called because the model must generalize on the basis of e.g. sentence structure or task structure, rather than content encoded in token embeddings. We study structural in-context algorithms on both synthetic and naturalistic tasks using toy models, masked language models, and autoregressive language models. We find that structural ICL appears before quickly disappearing early in LM pretraining. While it has been shown that ICL can diminish during training (Singh et al., 2023), we find that prior work does not account for structural ICL. Building on Chen et al. (2024) 's active forgetting method, we introduce pretraining and finetuning methods that can modulate the preference for structural ICL and IWL. Importantly, this allows us to induce a $\textit{dual process strategy}$ where in-context and in-weights solutions coexist within a single model.

Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting

TL;DR

Abstract

, which we define as the ability of a model to execute in-context learning on arbitrary novel tokens -- so called because the model must generalize on the basis of e.g. sentence structure or task structure, rather than content encoded in token embeddings. We study structural in-context algorithms on both synthetic and naturalistic tasks using toy models, masked language models, and autoregressive language models. We find that structural ICL appears before quickly disappearing early in LM pretraining. While it has been shown that ICL can diminish during training (Singh et al., 2023), we find that prior work does not account for structural ICL. Building on Chen et al. (2024) 's active forgetting method, we introduce pretraining and finetuning methods that can modulate the preference for structural ICL and IWL. Importantly, this allows us to induce a

where in-context and in-weights solutions coexist within a single model.

Paper Structure (49 sections, 21 figures)

This paper contains 49 sections, 21 figures.

Introduction
Definitions
In-Context vs. In-Weights Learning
Structural vs. Conditional ICL
Head vs. Tail
(Structural) In-Context Learning is Transient
Task
Training Dynamics
Structural ICL
In-Context vs. In-Weights Strategies
Data Distribution Impacts In-Context Learning
Training Dynamics
Transience of Structural ICL
In-Context Learning conflicts with In-Weights Learning
Maintaining Structural ICL with Active Forgetting
...and 34 more sections

Figures (21)

Figure 1: (Top Left) In our naturalistic setting, we train a part-of-speech probe on BERT representations of sentences from Penn Treebank 3 and evaluate it on templatic examples (Section \ref{['sec:natural_setting']}). (Top Right) In our synthetic setting, we train a small masked language model (MLM) on sequences where the expected response is determined based on the part-of-speech of the query token (Section \ref{['sec:synthetic_setting']}). (Bottom Left) An idealization of two main findings: (1) structural ICL is transient (i.e. decays over training) in both naturalistic and synthetic settings, and (2) Active/temporary forgetting maintains structural ICL in the synthetic setting. (Bottom Right) Temporary forgetting induces structural ICL when applied for $N>0$ steps, enabling generalization to unseen random tokens. In-weights preference is coarsely controllable by varying temporary forgetting parameter $N$.
Figure 2: (Left) Structural ICL is transient, as Random Token accuracy first peaks and then decays. (Middle) We investigate the benefit of contextualization over memorization in Head and Tail datasets by examining the difference in Layer 7 Accuracy (where both in-context and in-weights strategies are possible) and Layer 0 Accuracy (where only an in-weights strategy is possible). These differences become negligible after sufficient training. (Right) Using the Head Switch and Tail Switch datasets, we find that models begin to encode POS using an IWL strategy over time. Note that the x-axis begins at training step 20,000 for (Middle) and (Right).
Figure 3: Comparative analysis of in-context learning performance across training methodologies and data distributions. (Top) In-context performance by distribution with vanilla training; (Bottom) In-context performance by distribution with active forgetting. The parameters used are $v=10000,\varepsilon=0.10$. Note that the Uniform distribution does not have a head or a tail, and we present results in the head graphs. (Top Left) Vanilla training results in structural ICL transience across all distributions. (Top Middle, Top Right) Conditional ICL is asymptotically nonzero for most distributions, unless they are highly skewed (i.e., $\alpha=1.5$). (Top Middle) However, IWL is often preferred for head tokens and (Top Right) conditional ICL is preferred for tail tokens. (Bottom Row) In contrast, active forgetting preserves structural ICL and removes all preference for IWL across distributions and datasets. Note: The y-axis in the bottom left is relabelled "Unseen Token Accuracy" to emphasize that the random token evaluation dataset does not contain any random embeddings seen during active forgetting.
Figure 4: (Left) Temporary forgetting achieves near perfect unseen random token performance across distributions, indicating structural ICL. (Left, Green) Vanilla training on skewed distributions renders tail token performance poor; (Left, Blue) In contrast, tail token performance is almost perfect after temporary forgetting. (Right) Temporary forgetting can maintain a preference for IWL for the head of the distribution while maintaining a preference for ICL for the tail of the distribution i.e., temporary forgetting induces dual processes learning. Parameters used are $v=10000,\varepsilon=0.10$ and optimal hyperparameters $k,N$ are found using a grid search.
Figure 5: Performance by token decile and on randomly initialized embeddings (Rnd). (Left) With vanilla training on a skewed distribution (Zipfian $\alpha=1.5$), low decile tokens show poor performance. However, overall performance remains good because these tokens are rare. (Right) Temporary forgetting induces structural ICL to recover performance on tail, undertrained, and unseen tokens compared with singh2023transient's $L_2$-regularization procedure, which was proposed to preserve conditional ICL.
...and 16 more figures

Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting

TL;DR

Abstract

Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting

Authors

TL;DR

Abstract

Table of Contents

Figures (21)