Table of Contents
Fetching ...

Why Larger Language Models Do In-context Learning Differently?

Zhenmei Shi, Junyi Wei, Zhuoyan Xu, Yingyu Liang

TL;DR

The paper addresses why larger language models exhibit different in-context learning behaviors, especially under context noise, by analyzing two stylized settings: a one-layer, single-head linear self-attention model for linear regression and a two-layer, multi-head transformer for sparse parity classification. It derives closed-form optimal solutions showing that smaller models prioritize a limited set of important feature directions (top eigen-directions of the token covariance) while larger models attend to more directions, increasing susceptibility to noise yet enabling broader feature coverage. Theoretical results quantify robustness–noise trade-offs and are corroborated by NLP experiments with LLama models, illustrating larger models can outperform in clean contexts but are more easily disrupted by label and input noise. The findings illuminate how attention scale shapes ICL and suggest practical guidance for robust deployment and architecture design in LLMs.

Abstract

Large language models (LLM) have emerged as a powerful tool for AI, with the key ability of in-context learning (ICL), where they can perform well on unseen tasks based on a brief series of task examples without necessitating any adjustments to the model parameters. One recent interesting mysterious observation is that models of different scales may have different ICL behaviors: larger models tend to be more sensitive to noise in the test context. This work studies this observation theoretically aiming to improve the understanding of LLM and ICL. We analyze two stylized settings: (1) linear regression with one-layer single-head linear transformers and (2) parity classification with two-layer multiple attention heads transformers (non-linear data and non-linear model). In both settings, we give closed-form optimal solutions and find that smaller models emphasize important hidden features while larger ones cover more hidden features; thus, smaller models are more robust to noise while larger ones are more easily distracted, leading to different ICL behaviors. This sheds light on where transformers pay attention to and how that affects ICL. Preliminary experimental results on large base and chat models provide positive support for our analysis.

Why Larger Language Models Do In-context Learning Differently?

TL;DR

The paper addresses why larger language models exhibit different in-context learning behaviors, especially under context noise, by analyzing two stylized settings: a one-layer, single-head linear self-attention model for linear regression and a two-layer, multi-head transformer for sparse parity classification. It derives closed-form optimal solutions showing that smaller models prioritize a limited set of important feature directions (top eigen-directions of the token covariance) while larger models attend to more directions, increasing susceptibility to noise yet enabling broader feature coverage. Theoretical results quantify robustness–noise trade-offs and are corroborated by NLP experiments with LLama models, illustrating larger models can outperform in clean contexts but are more easily disrupted by label and input noise. The findings illuminate how attention scale shapes ICL and suggest practical guidance for robust deployment and architecture design in LLMs.

Abstract

Large language models (LLM) have emerged as a powerful tool for AI, with the key ability of in-context learning (ICL), where they can perform well on unseen tasks based on a brief series of task examples without necessitating any adjustments to the model parameters. One recent interesting mysterious observation is that models of different scales may have different ICL behaviors: larger models tend to be more sensitive to noise in the test context. This work studies this observation theoretically aiming to improve the understanding of LLM and ICL. We analyze two stylized settings: (1) linear regression with one-layer single-head linear transformers and (2) parity classification with two-layer multiple attention heads transformers (non-linear data and non-linear model). In both settings, we give closed-form optimal solutions and find that smaller models emphasize important hidden features while larger ones cover more hidden features; thus, smaller models are more robust to noise while larger ones are more easily distracted, leading to different ICL behaviors. This sheds light on where transformers pay attention to and how that affects ICL. Preliminary experimental results on large base and chat models provide positive support for our analysis.
Paper Structure (26 sections, 4 theorems, 50 equations, 3 figures)

This paper contains 26 sections, 4 theorems, 50 equations, 3 figures.

Key Result

Lemma 4.1

Let $\Gamma := \left(1+{1\over N}\right)\Lambda + {1\over N} \mathop{\mathrm{tr}}\limits(\Lambda) I_{d \times d} \in \mathbb{R}^{d\times d}$. Let we have $\mathcal{L}(f_{\textup{LSA},\theta}) = \tilde{\ell}(\mathbf{U}, u) + C$, where $C$ is a constant independent with $\theta$.

Figures (3)

  • Figure 1: Larger models are easier to be affected by noise (flipped labels) and override pretrained biases than smaller models for different datasets and model families (chat/with instruct turning). Accuracy is calculated over 1000 evaluation prompts per dataset and over 5 runs with different random seeds for each evaluation, using $M = 16$ in-context exemplars.
  • Figure 2: Larger models are easier to be affected by noise (flipped labels) and override pretrained biases than smaller models for different datasets and model families (original/without instruct turning). Accuracy is calculated over 1000 evaluation prompts per dataset and over 5 runs with different random seeds for each evaluation, using $M = 16$ in-context exemplars.
  • Figure 3: The magnitude of attention between the labels and input sentences in Llama 2-13b and 70b on 100 evaluation prompts; see the main text for the details. $x$-axis: indices of the prompts. $y$-axis: the norm of the last row of attention maps in the final layer. Correct: original label; wrong: flipped label; relevant: original input sentence; irrelevant: irrelevant sentence from other datasets. The results show that larger models focus on both sentences, while smaller models only focus on relevant sentences.

Theorems & Definitions (13)

  • Lemma 4.1: Lemma A.1 in zhang2023trained
  • proof : Proof sketch of \ref{['thm:low_opt']}
  • proof : Proof sketch of \ref{['theorem:opt_parity']}
  • Remark 5.1
  • proof : Proof of \ref{['thm:low_opt']}
  • proof : Proof of \ref{['thm:mse']}
  • proof : Proof of \ref{['prop:diff']}
  • Lemma 2.1: Corollary A.2 in zhang2023trained
  • Lemma 2.2
  • proof : Proof of \ref{['lem:Isserlis']}
  • ...and 3 more