Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting
Suraj Anand, Michael A. Lepori, Jack Merullo, Ellie Pavlick
TL;DR
The paper investigates how language models can flexibly deploy in-context learning (ICL) and in-weights learning (IWL) by introducing structural ICL, a form of ICL invariant to token embeddings. It shows that structural ICL is transient during pretraining in both naturalistic and synthetic settings, as models encode information into weights, but that forgetting-based interventions—active forgetting, temporary forgetting, and probabilistic temporary forgetting—can sustain or induce dual-process behavior where ICL handles unseen or tail tokens while IWL covers frequent tokens. Active forgetting preserves structural ICL at the cost of IWL, whereas temporary forgetting enables a controllable balance, producing robust dual processing across distributions; probabilistic temporary forgetting extends this capability to pretrained models like GPT-2. The findings offer a practical pathway to robust generalization and token-type specialization, with implications for curriculum design, long-horizon pretraining, and efficient fine-tuning in skewed textual domains, enabling models to both memorize common patterns and generalize to rare or novel tokens.
Abstract
Language models have the ability to perform in-context learning (ICL), allowing them to flexibly adapt their behavior based on context. This contrasts with in-weights learning (IWL), where memorized information is encoded in model parameters after iterated observations of data. An ideal model should be able to flexibly deploy both of these abilities. Despite their apparent ability to learn in-context, language models are known to struggle when faced with unseen or rarely seen tokens (Land & Bartolo, 2024). Hence, we study $\textbf{structural in-context learning}$, which we define as the ability of a model to execute in-context learning on arbitrary novel tokens -- so called because the model must generalize on the basis of e.g. sentence structure or task structure, rather than content encoded in token embeddings. We study structural in-context algorithms on both synthetic and naturalistic tasks using toy models, masked language models, and autoregressive language models. We find that structural ICL appears before quickly disappearing early in LM pretraining. While it has been shown that ICL can diminish during training (Singh et al., 2023), we find that prior work does not account for structural ICL. Building on Chen et al. (2024) 's active forgetting method, we introduce pretraining and finetuning methods that can modulate the preference for structural ICL and IWL. Importantly, this allows us to induce a $\textit{dual process strategy}$ where in-context and in-weights solutions coexist within a single model.
