Table of Contents
Fetching ...

Is In-Context Learning Learning?

Adrian de Wynter

TL;DR

This work investigates whether in-context learning (ICL) in autoregressive LLMs truly learns from exemplars or merely exploits prompt-based cues. By conducting a large-scale, controlled empirical study across four LLMs and nine tasks, with extensive prompting ablations and distributional-shift analyses, the authors show that ICL behaves as a learning mechanism in the limit, yet exhibits notable brittleness to out-of-distribution inputs and sensitivity to prompting style. They demonstrate that peak performance emerges with substantial exemplar counts (50–100) and that, over time, language, exemplar ordering, and label distributions matter less, while data features dominate; however, cross-task generalisation remains limited due to overreliance on observed distributions. The findings argue for cautious interpretation of LLM capabilities, highlight the need for robust, distribution-aware evaluation, and suggest future work on prompts and reasoning architectures that more closely emulate robust memory-and-reasoning structures. Overall, ICL is a bona fide learning paradigm but with constrained generalisability and explicit dependence on prompt design and data distribution.

Abstract

In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model's ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL does constitute learning, but its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that ICL is an effective learning paradigm, but limited in its ability to learn and generalise to unseen tasks. We note that, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input's linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies on formally similar tasks, we conclude that autoregression's ad-hoc encoding is not a robust mechanism, and suggests limited all-purpose generalisability.

Is In-Context Learning Learning?

TL;DR

This work investigates whether in-context learning (ICL) in autoregressive LLMs truly learns from exemplars or merely exploits prompt-based cues. By conducting a large-scale, controlled empirical study across four LLMs and nine tasks, with extensive prompting ablations and distributional-shift analyses, the authors show that ICL behaves as a learning mechanism in the limit, yet exhibits notable brittleness to out-of-distribution inputs and sensitivity to prompting style. They demonstrate that peak performance emerges with substantial exemplar counts (50–100) and that, over time, language, exemplar ordering, and label distributions matter less, while data features dominate; however, cross-task generalisation remains limited due to overreliance on observed distributions. The findings argue for cautious interpretation of LLM capabilities, highlight the need for robust, distribution-aware evaluation, and suggest future work on prompts and reasoning architectures that more closely emulate robust memory-and-reasoning structures. Overall, ICL is a bona fide learning paradigm but with constrained generalisability and explicit dependence on prompt design and data distribution.

Abstract

In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model's ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL does constitute learning, but its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that ICL is an effective learning paradigm, but limited in its ability to learn and generalise to unseen tasks. We note that, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input's linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies on formally similar tasks, we conclude that autoregression's ad-hoc encoding is not a robust mechanism, and suggests limited all-purpose generalisability.

Paper Structure

This paper contains 46 sections, 7 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Data generator for PARITY. Each state has transition probabilities $\delta$, and an emission probability. There is a symmetric automaton with emissions at 0.
  • Figure 2: Average accuracy results for (rows, top to bottom) all-task, PARITY, and Reversal in (left to right) modus ponens, description, and CoT; plotted over shots (thick vertical lines) and per-shot $\delta$ between them. PARITY showed good performance even though it considered difficult for LLMs hahn-rofin-2024-sensitive. Reversal had low average accuracy and was brittle to OOD, with sharp decreases per-shot w.r.t. $\delta$, even as shots increased.
  • Figure 3: Averaged over all tasks and models, all prompts have a positive slope (5.2$\pm$1.6) over shots, and a narrowing gap in their $\sigma$ (-2.6$\pm$0.5).
  • Figure 4: Complete set of performances per problem, including averages at the top. Observe how the averages do not necessarily correspond to the performance per-model per-prompt per-task. Consistent behaviours are that CoT is not robust to OOD, and that tasks on average present the same approximate behaviour regardless of prompt.
  • Figure 5: Average over all LLMs and tasks for baseline (left) and word-salad (right) prompts. Description-based prompts rarely performed poorly at zero-shot for all LLMs but Mixtral, while word-salad versions required five shots (Mixtral), ten (GPT-4o), or more. They eventually reached equivalence with their baselines (Table \ref{['tab:wscomparison']}). In high-accuracy tasks (Hamiltonian, Maze (Complete) and PARITY) the prompts matched DE and modus ponens at between 10 and 100 exemplars. CoT versus SoT had different behaviours: CoT had an (average) modestly increasing trend not reproduced in SoT. This was an average: tasks such as Reversal had the same brittleness to OOD than their CoT counterparts; and tasks such as PARITY even showed non-zero performance.
  • ...and 2 more figures