Table of Contents
Fetching ...

In-Context Learning Learns Label Relationships but Is Not Conventional Learning

Jannik Kossen, Yarin Gal, Tom Rainforth

TL;DR

This work investigates how large language models leverage in-context input-label relationships, addressing whether ICL behaves like a conventional learning algorithm. By formalizing three null hypotheses and conducting exhaustive probabilistic analyses across multiple models and tasks, the authors show that ICL predictions depend on in-context labels and can learn novel relations in-context, yet pre-training biases persist and ICL does not treat all in-context information equally. The study introduces a fast, single-pass method to map ICL dynamics across all context sizes and uses randomized labels, novel tasks, and label-flipping experiments to reveal the nuanced interplay between in-context cues and pre-training. The results provide a nuanced, middle-ground understanding of ICL: it is capable of learning label relations in-context but does not fully realize conventional learning behavior, with context proximity influencing information integration. These findings have practical implications for aligning and deploying LLMs, suggesting that prompting alone may be insufficient to override deep-seated pre-training preferences and that awareness of information weighting in context is important for robust ICL-based applications.

Abstract

The predictions of Large Language Models (LLMs) on downstream tasks often improve significantly when including examples of the input--label relationship in the context. However, there is currently no consensus about how this in-context learning (ICL) ability of LLMs works. For example, while Xie et al. (2021) liken ICL to a general-purpose learning algorithm, Min et al. (2022) argue ICL does not even learn label relationships from in-context examples. In this paper, we provide novel insights into how ICL leverages label information, revealing both capabilities and limitations. To ensure we obtain a comprehensive picture of ICL behavior, we study probabilistic aspects of ICL predictions and thoroughly examine the dynamics of ICL as more examples are provided. Our experiments show that ICL predictions almost always depend on in-context labels and that ICL can learn truly novel tasks in-context. However, we also find that ICL struggles to fully overcome prediction preferences acquired from pre-training data and, further, that ICL does not consider all in-context information equally.

In-Context Learning Learns Label Relationships but Is Not Conventional Learning

TL;DR

This work investigates how large language models leverage in-context input-label relationships, addressing whether ICL behaves like a conventional learning algorithm. By formalizing three null hypotheses and conducting exhaustive probabilistic analyses across multiple models and tasks, the authors show that ICL predictions depend on in-context labels and can learn novel relations in-context, yet pre-training biases persist and ICL does not treat all in-context information equally. The study introduces a fast, single-pass method to map ICL dynamics across all context sizes and uses randomized labels, novel tasks, and label-flipping experiments to reveal the nuanced interplay between in-context cues and pre-training. The results provide a nuanced, middle-ground understanding of ICL: it is capable of learning label relations in-context but does not fully realize conventional learning behavior, with context proximity influencing information integration. These findings have practical implications for aligning and deploying LLMs, suggesting that prompting alone may be insufficient to override deep-seated pre-training preferences and that awareness of information weighting in context is important for robust ICL-based applications.

Abstract

The predictions of Large Language Models (LLMs) on downstream tasks often improve significantly when including examples of the input--label relationship in the context. However, there is currently no consensus about how this in-context learning (ICL) ability of LLMs works. For example, while Xie et al. (2021) liken ICL to a general-purpose learning algorithm, Min et al. (2022) argue ICL does not even learn label relationships from in-context examples. In this paper, we provide novel insights into how ICL leverages label information, revealing both capabilities and limitations. To ensure we obtain a comprehensive picture of ICL behavior, we study probabilistic aspects of ICL predictions and thoroughly examine the dynamics of ICL as more examples are provided. Our experiments show that ICL predictions almost always depend on in-context labels and that ICL can learn truly novel tasks in-context. However, we also find that ICL struggles to fully overcome prediction preferences acquired from pre-training data and, further, that ICL does not consider all in-context information equally.
Paper Structure (17 sections, 3 equations, 11 figures, 3 tables)

This paper contains 17 sections, 3 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: ICL predictions generally depend on the conditional label distribution of in-context examples: when in-context labels are randomized, average log likelihoods of label predictions decrease compared to ICL with default labels for LLaMa-2-70B across a variety of tasks. Results averaged over 500.0 in-context datasets and thin lines are 99% confidence intervals. See §\ref{['sec:random']} for details.
  • Figure 2: Few-shot ICL training dynamics in a default label scenario on SST-2. Accuracy ($\uparrow$) and log likelihood ($\uparrow$) improve with in-context dataset size, and entropies decrease appropriately. Averages over 500.0 random subsets, thick lines with moving average (window size 5.0) for clarity.
  • Figure 3: Few-shot ICL with randomized labels for SST-2: Compared to default ICL behavior (dashed lines), log likelihoods and entropies of the Falcon models degrade when in-context labels are randomized. Accuracies show differences less clearly than probabilistic log likelihood and entropy. Averages over 500.0 repetitions, thick lines with moving average (window size 5.0) for clarity.
  • Figure 4: Few-shot ICL achieves accuracies significantly better than random guessing on our novel author identification task. Thus, LLMs can learn novel label relationships entirely in-context. Averages over 500.0 runs, thick lines with additional moving average (window size 5.0) for clarity.
  • Figure 5: Few-shot ICL with replacement labels for Falcon-40B on SST-2, LLaMa-65B on Hate Speech, and LLaMa-2-70B on MQP. \ref{['tab:flipped_summary_main']} and §\ref{['sec:extended_results']} contain results for all other models and tasks. ICL achieves better than guessing performance for all label relations and models. However, predictions for flipped labels (dashed blue) plateau at a higher entropies and lower likelihoods than those for the default label relation (solid blue). For arbitrary labels (pink), the model performs similarly for both label directions. Averages over 100.0 runs and thick lines with moving average (window size 5.0).
  • ...and 6 more figures