Table of Contents
Fetching ...

Strategy Coopetition Explains the Emergence and Transience of In-Context Learning

Aaditya K. Singh, Ted Moskovitz, Sara Dragutinovic, Felix Hill, Stephanie C. Y. Chan, Andrew M. Saxe

TL;DR

The paper investigates why in-context learning (ICL) can emerge and then fade during transformer training. It reveals an asymptotic strategy called context-constrained in-weights learning (CIWL) that combines in-weights information with a contextual label cue, and shows that ICL and CIWL share subcircuits, enabling strategy coopetition. A minimal mathematical model and a two-layer transformer study illustrate how ICL can transiently appear due to cooperative L2 subcircuits and later yield to CIWL as the dominant method; data properties and exemplar matching can tilt the balance toward persistent ICL. The findings advance our mechanistic understanding of dynamic strategy selection during learning and suggest data-driven interventions to shape the emergence and persistence of ICL and related capabilities.

Abstract

In-context learning (ICL) is a powerful ability that emerges in transformer models, enabling them to learn from context without weight updates. Recent work has established emergent ICL as a transient phenomenon that can sometimes disappear after long training times. In this work, we sought a mechanistic understanding of these transient dynamics. Firstly, we find that, after the disappearance of ICL, the asymptotic strategy is a remarkable hybrid between in-weights and in-context learning, which we term "context-constrained in-weights learning" (CIWL). CIWL is in competition with ICL, and eventually replaces it as the dominant strategy of the model (thus leading to ICL transience). However, we also find that the two competing strategies actually share sub-circuits, which gives rise to cooperative dynamics as well. For example, in our setup, ICL is unable to emerge quickly on its own, and can only be enabled through the simultaneous slow development of asymptotic CIWL. CIWL thus both cooperates and competes with ICL, a phenomenon we term "strategy coopetition." We propose a minimal mathematical model that reproduces these key dynamics and interactions. Informed by this model, we were able to identify a setup where ICL is truly emergent and persistent.

Strategy Coopetition Explains the Emergence and Transience of In-Context Learning

TL;DR

The paper investigates why in-context learning (ICL) can emerge and then fade during transformer training. It reveals an asymptotic strategy called context-constrained in-weights learning (CIWL) that combines in-weights information with a contextual label cue, and shows that ICL and CIWL share subcircuits, enabling strategy coopetition. A minimal mathematical model and a two-layer transformer study illustrate how ICL can transiently appear due to cooperative L2 subcircuits and later yield to CIWL as the dominant method; data properties and exemplar matching can tilt the balance toward persistent ICL. The findings advance our mechanistic understanding of dynamic strategy selection during learning and suggest data-driven interventions to shape the emergence and persistence of ICL and related capabilities.

Abstract

In-context learning (ICL) is a powerful ability that emerges in transformer models, enabling them to learn from context without weight updates. Recent work has established emergent ICL as a transient phenomenon that can sometimes disappear after long training times. In this work, we sought a mechanistic understanding of these transient dynamics. Firstly, we find that, after the disappearance of ICL, the asymptotic strategy is a remarkable hybrid between in-weights and in-context learning, which we term "context-constrained in-weights learning" (CIWL). CIWL is in competition with ICL, and eventually replaces it as the dominant strategy of the model (thus leading to ICL transience). However, we also find that the two competing strategies actually share sub-circuits, which gives rise to cooperative dynamics as well. For example, in our setup, ICL is unable to emerge quickly on its own, and can only be enabled through the simultaneous slow development of asymptotic CIWL. CIWL thus both cooperates and competes with ICL, a phenomenon we term "strategy coopetition." We propose a minimal mathematical model that reproduces these key dynamics and interactions. Informed by this model, we were able to identify a setup where ICL is truly emergent and persistent.

Paper Structure

This paper contains 31 sections, 1 equation, 18 figures.

Figures (18)

  • Figure 1: (a) Example sequences seen during training and evaluation. Training data is "bursty", enabling both in-context and in-weights strategies (the context always contains an exemplar from the same class as the query, but also exemplar-label mappings are fixed throughout training). Evaluation sequences (below dotted line) are designed to measure the presence of different strategies. ICL relies on the exemplar-label mapping in context. IWL depends solely on in-weights information. CIWL requires the correct label in context, but not the query exemplar. The Flip evaluator measures the balance between ICL and CIWL (1.0 means pure ICL, 0.0 means pure CIWL). Bolding indicates OOD exemplar-label pairings. Grayed outputs indicate random selection between the two in-context labels. (b) Accuracy on sequences from (a), over the course of training. "In-context accuracy" is computed by restricting the network's outputs to the two labels present in context---this ensures the same chance level (0.5) for all plotted evaluators. ICL transience is clearly visible in blue. IWL is not shown, as we found little-to-no IWL in the networks (Appendix \ref{['appx:results:iwl']}). We annotate four points: 1. the formation of Layer 2 circuits, the canonical "induction head"; 2. ICL strategy dominates network output, as evidenced by peak in the Flip evaluator (red); 3. CIWL strategy matches strength of ICL, as indicated by 50% performance on Flip evaluator; 4. CIWL strategy dominates network output, leading Flip evaluator (red) to be 0 and CIWL evaluator (green) to be 1. (c) Illustration of competitive (Layer 1) and cooperative (Layer 2) interactions we find between ICL and CIWL strategies. Both strategies are present in varying amounts through training, as represented by the varying line weights in the Layer 1 circuits: when Layer 1 acts as previous token heads, the network exhibits ICL, but when Layer 1 heads attend to self, the network exhibits CIWL. Crucially, the computation in Layer 2 remains largely unchanged after its initial formation, despite the strategy switch from ICL to CIWL.
  • Figure 2: CIWL strategy is implemented via skip-trigram-like mechanisms in Layer 2 (L2), with substantial K- and V- composition to Layer 1 (L1). (a) Average attention patterns for L2 heads at the end of training. Attention is measured from the query token (index 5) to each token in context. It is computed over CIWL sequences where the correct label is at index 2 (results for label at index 4 in Fig \ref{['fig:ciwl2']}). We see that, at the end of training, L2 heads attend to the correct label, regardless of what exemplar it is paired with in context. (b) Task performance as a function of clamped attention delta to correct vs. incorrect label token, calculated over 5000 CIWL sequences, when only the given head is active. CIWL accuracy increases as attention to the correct label increases.
  • Figure 3: The transience of emergent ICL is driven by changes in the function of Layer 1. (a) Average induction strength (attention delta to correct ICL token vs incorrect ICL token) on Flip eval data of each Layer 2 head, through the course of training. We see induction circuits emerge then flip, matching the end-of-training attention patterns shown in Fig \ref{['fig:ciwl']}. (b) For each available checkpoint, we fix Layer 2 weights to be those from the end of training, and plot performance on each of our evaluators. Using the Layer 2 weights from the end of training (darker curves) reproduces the original behavior (lighter curves; matches Fig \ref{['fig:main']}b) at all points in training after the initial flat portion. This indicates that Layer 2 is not meaningfully changing during the transition from ICL to CIWL. (c) For each available checkpoint, we fix the Layer 1 weights to those from a specific checkpoint (marked by the dotted orange vertical lines). After the Layer 1 weights are fixed in this way (darker curves), network behavior doesn't change, as evidenced by the flat lines on all the data we considered.
  • Figure 4: ICL emergence is enabled due to cooperative interactions with CIWL. (a) In this plot, we train networks on "ICL-only" data, i.e. where ICL is a viable strategy but CIWL is not. Without any interventions (black), we hit a loss plateau that greatly slows learning singh2024needs. However, if we replace the Layer 2 and unembedding weights with those from end-of-training on our standard training data (green), the network learns quickly. We thus see that these weights, which were part of a CIWL strategy, are reusable for learning ICL. (b) We further consider using Layer 2 + unembedding weights from different checkpoints of a "CIWL-only" run, and then training on "ICL-only" data. Early and late checkpoints lead to no learning of ICL, but middling checkpoints (from 9.8M to 31.5M, identified via binary search) do enable ICL learning. (c) We continued training different checkpoints from the "CIWL-only" run on our standard training data. Once CIWL has formed (later checkpoints), ICL does not re-emerge even when switching to the bursty data (which otherwise permits ICL).
  • Figure 5: Minimal mathematical model captures key phenomena of strategy racing and coopetition. (a) Toy model dynamics. (b) Loss of real transformer, corresponding to Fig \ref{['fig:main']}b. Notably, both curves exhibit transience behavior in Mechanism 1 (ICL). Intriguingly, the toy model also captures nonmonotonicity in the emergence of Mechanism 2 (CIWL), highlighted in orange.
  • ...and 13 more figures