Is In-Context Learning Learning?
Adrian de Wynter
TL;DR
This work investigates whether in-context learning (ICL) in autoregressive LLMs truly learns from exemplars or merely exploits prompt-based cues. By conducting a large-scale, controlled empirical study across four LLMs and nine tasks, with extensive prompting ablations and distributional-shift analyses, the authors show that ICL behaves as a learning mechanism in the limit, yet exhibits notable brittleness to out-of-distribution inputs and sensitivity to prompting style. They demonstrate that peak performance emerges with substantial exemplar counts (50–100) and that, over time, language, exemplar ordering, and label distributions matter less, while data features dominate; however, cross-task generalisation remains limited due to overreliance on observed distributions. The findings argue for cautious interpretation of LLM capabilities, highlight the need for robust, distribution-aware evaluation, and suggest future work on prompts and reasoning architectures that more closely emulate robust memory-and-reasoning structures. Overall, ICL is a bona fide learning paradigm but with constrained generalisability and explicit dependence on prompt design and data distribution.
Abstract
In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model's ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL does constitute learning, but its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that ICL is an effective learning paradigm, but limited in its ability to learn and generalise to unseen tasks. We note that, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input's linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies on formally similar tasks, we conclude that autoregression's ad-hoc encoding is not a robust mechanism, and suggests limited all-purpose generalisability.
