Table of Contents
Fetching ...

Competition Dynamics Shape Algorithmic Phases of In-Context Learning

Core Francisco Park, Ekdeep Singh Lubana, Itamar Pres, Hidenori Tanaka

TL;DR

This work argues that in-context learning (ICL) can be understood as a competition among multiple algorithmic strategies rather than a single capability. By training transformers on a synthetic task that simulates a finite mixture of Markov chains, the authors reproduce core ICL phenomena and identify four distinct algorithms—two retrieval-based (Uni-Ret, Bi-Ret) and two inference-based (Uni-Inf, Bi-Inf)—that compete to drive next-token predictions. They introduce a linear-interpolation framework (LIA) to decompose model outputs into a convex combination of these algorithms, revealing phase diagrams that shift with data diversity, training steps, and architecture, and explaining non-monotonic OOD performance and transience in ICL. The findings offer a unified, mechanistic lens on ICL, with implications for how to design data, models, and training protocols to promote robust, generalizable in-context reasoning rather than memorization or monolithic capabilities.

Abstract

In-Context Learning (ICL) has significantly expanded the general-purpose nature of large language models, allowing them to adapt to novel tasks using merely the inputted context. This has motivated a series of papers that analyze tractable synthetic domains and postulate precise mechanisms that may underlie ICL. However, the use of relatively distinct setups that often lack a sequence modeling nature to them makes it unclear how general the reported insights from such studies are. Motivated by this, we propose a synthetic sequence modeling task that involves learning to simulate a finite mixture of Markov chains. As we show, models trained on this task reproduce most well-known results on ICL, hence offering a unified setting for studying the concept. Building on this setup, we demonstrate we can explain a model's behavior by decomposing it into four broad algorithms that combine a fuzzy retrieval vs. inference approach with either unigram or bigram statistics of the context. These algorithms engage in a competition dynamics to dominate model behavior, with the precise experimental conditions dictating which algorithm ends up superseding others: e.g., we find merely varying context size or amount of training yields (at times sharp) transitions between which algorithm dictates the model behavior, revealing a mechanism that explains the transient nature of ICL. In this sense, we argue ICL is best thought of as a mixture of different algorithms, each with its own peculiarities, instead of a monolithic capability. This also implies that making general claims about ICL that hold universally across all settings may be infeasible.

Competition Dynamics Shape Algorithmic Phases of In-Context Learning

TL;DR

This work argues that in-context learning (ICL) can be understood as a competition among multiple algorithmic strategies rather than a single capability. By training transformers on a synthetic task that simulates a finite mixture of Markov chains, the authors reproduce core ICL phenomena and identify four distinct algorithms—two retrieval-based (Uni-Ret, Bi-Ret) and two inference-based (Uni-Inf, Bi-Inf)—that compete to drive next-token predictions. They introduce a linear-interpolation framework (LIA) to decompose model outputs into a convex combination of these algorithms, revealing phase diagrams that shift with data diversity, training steps, and architecture, and explaining non-monotonic OOD performance and transience in ICL. The findings offer a unified, mechanistic lens on ICL, with implications for how to design data, models, and training protocols to promote robust, generalizable in-context reasoning rather than memorization or monolithic capabilities.

Abstract

In-Context Learning (ICL) has significantly expanded the general-purpose nature of large language models, allowing them to adapt to novel tasks using merely the inputted context. This has motivated a series of papers that analyze tractable synthetic domains and postulate precise mechanisms that may underlie ICL. However, the use of relatively distinct setups that often lack a sequence modeling nature to them makes it unclear how general the reported insights from such studies are. Motivated by this, we propose a synthetic sequence modeling task that involves learning to simulate a finite mixture of Markov chains. As we show, models trained on this task reproduce most well-known results on ICL, hence offering a unified setting for studying the concept. Building on this setup, we demonstrate we can explain a model's behavior by decomposing it into four broad algorithms that combine a fuzzy retrieval vs. inference approach with either unigram or bigram statistics of the context. These algorithms engage in a competition dynamics to dominate model behavior, with the precise experimental conditions dictating which algorithm ends up superseding others: e.g., we find merely varying context size or amount of training yields (at times sharp) transitions between which algorithm dictates the model behavior, revealing a mechanism that explains the transient nature of ICL. In this sense, we argue ICL is best thought of as a mixture of different algorithms, each with its own peculiarities, instead of a monolithic capability. This also implies that making general claims about ICL that hold universally across all settings may be infeasible.

Paper Structure

This paper contains 70 sections, 12 equations, 42 figures, 1 table.

Figures (42)

  • Figure 1: Algorithmic phase diagram for a finite Markov mixtures task. We propose to study ICL phenomena through a minimal experimental system: Transformers trained on sequence data generated by a finite mixture of Markov chains. This setup turns out to be extremely rich, capturing most (if not all) known phenomenology of ICL, but still being simple enough to be amenable to theoretical modeling. We identify four distinct, interpretable algorithmic solutions and characterize the transitions between them as functions of data diversity, optimization steps, and context size---labeled algorithmic phases. Considering the corresponding phase diagram (middle panel), we find part of the rich phenomenology of ICL emerges from competing algorithmic strategies promoted or suppressed by broader experimental configurations. Specifically, our framework captures an array of known phenomena: a) Data diversity threshold ravent처s2023pretrainingtaskdiversityemergence; b) Emergence of induction heads edelman2024evolutionstatisticalinductionheads; c) Transient nature singh2023transientnatureemergentincontext; d) Task retrieval and task learning phases min2022rethinking; e) Early ascent of risk xie2021explanation; and f) Bounded efficacy lin2024dualoperatingmodesincontext. See App. \ref{['app:icl_phenomena']} for a concise summary of these findings and propositions.
  • Figure 2: Data generation and evaluation protocol with finite Markov mixtures. (a) Data generation. We first sample a finite set $\mathcal{T}_{\text{train}} = \{T_1, T_2, \ldots, T_N\}$ of random transition matrices to define our set of Markov chains. We then randomly select a chain from this set and sample a training sequence from it. We repeat this process at every step of training, sampling a fresh batch of sequences from by randomly selecting a chain from our predefined set. (b) Model training. We train a Transformer karpathy2022nanogpt on this sequence data with a standard autoregressive training loss. (c) Evaluation. A novel sequence of states is sampled from the test transition matrix, $T^*$, for evaluation. Here, $T^*$ is either (i) selected from the finite set $\mathcal{T}_{\text{train}}$ (for in-distribution tests), or (ii) newly sampled (for OOD tests). We subsequently compute the KL divergence between the model's empirical transition matrix $\hat{T}$ vs. ground truth transition matrix $T^*$. See App. \ref{['app:data_details']} for details.
  • Figure 3: Finite Markov mixture setup captures rich phenomenology of in-context learning (ICL). (a) KL divergence (OOD evaluation) as a function of training steps and data diversity (Number of Training Chains). (b) As the data diversity of the training data is increased (see ruby vertical dashed line in panel (a)), we reproduce the data diversity threshold for "task learning" ICL, similar to ravent처s2023pretrainingtaskdiversityemergencekirsch2022general. (c) At high task-diversity regime with $N=2^7$ (see green horizontal dashed line in panel (a)), we reproduce non-monotonic performance dynamics in a sequence modeling setup. This phenomenon was previously reported as "transient nature of ICL" in singh2023transientnatureemergentincontext. See App. \ref{['app:phenomena']} for more plots from these experiments.
  • Figure 4: Proposed algorithms for the finite Markov mixture task. (a) Unigram based Retrieval (Uni-Ret): Given a sequence, Uni-Ret involves computing a histogram of token frequencies in the sequence and then creating a new transition matrix that is a weighted average of chains in $\mathcal{T}_{\text{train}}$. Weight associated to a chain is based on the distance between the computed histogram and a chain's steady-state distribution. (b) Bigram based Retrieval (Bi-Ret): Similar to Uni-Ret, but uses observed transitions, i.e., bigrams, to weight the chains. The resulting likelihood is much sharper, making this algorithm better for the training data. (c) Unigram Inference (Uni-Inf): This algorithm infers a histogram from the given context and draws subsequent tokens from this histogram directly. (d) Bigram Inference (Bi-Inf): This algorithm infers the transition matrix from the given context and draws subsequent tokens from this transition matrix directly. This approach achieves best OOD generalization among considered algorithms. The $+$ and $-$ indicate the performance expected on ID chains and OOD chains, where a $+$ indicates better performance.
  • Figure 5: Algorithmic phases. (a) Bigram Utilization: We shuffle the order of all states in a sequence and measure the KL (App. Eq. \ref{['eq:bigrams']}) before and after the perturbation to quantify the bigram utilization of a model. The shuffling should only affect algorithms sensitive to higher-order statistics. (b) Proximity to Retrieval: A model is labeled "closer" to a retrieval approach when its next-token probabilities are closer to matrices seen in the training set. We evaluate this by sampling an unseen set of transition matrices, and measuring if the model's next-token probabilities have a lower KL w.r.t. transition matrices seen in training or if it is similar to the freshly sampled set (App. Eq. \ref{['eq:proximity']}). (c) Algorithmic Phases: The product of bigram utilization and proximity to retrieval scores delineates four distinct algorithmic phases. (d) Validating phases: KL between model's and predefined algorithms' next-token probabilities provides validation to our identified phase diagram.
  • ...and 37 more figures