Do pretrained Transformers Learn In-Context by Gradient Descent?

Lingfeng Shen; Aayush Mishra; Daniel Khashabi

Do pretrained Transformers Learn In-Context by Gradient Descent?

Lingfeng Shen, Aayush Mishra, Daniel Khashabi

TL;DR

This paper challenges the widely cited equivalence between In-Context Learning (ICL) and Gradient Descent (GD) by arguing that prior demonstrations rely on unrealistically restricted task and model spaces and on hand-crafted weight structures. Through theoretical critique and large-scale empirical analysis on natural-data pretrained models (not trained with an ICL objective), it shows that ICL and GD exhibit distinct order-sensitivity, convergence behavior, and output distribution patterns across datasets and model sizes. The results reveal a substantial gap between ICL and both GD and implicit-GD variants, suggesting that the equivalence remains an open hypothesis in practical settings. The work calls for more nuanced, real-world–aligned investigations to understand the true mechanisms behind ICL in pretrained transformers and cautions against extrapolating idealized GD-ICL mappings to natural-language modeling scenarios.

Abstract

The emergence of In-Context Learning (ICL) in LLMs remains a remarkable phenomenon that is partially understood. To explain ICL, recent studies have created theoretical connections to Gradient Descent (GD). We ask, do such connections hold up in actual pre-trained language models? We highlight the limiting assumptions in prior works that make their setup considerably different from the practical setup in which language models are trained. For example, their experimental verification uses \emph{ICL objective} (training models explicitly for ICL), which differs from the emergent ICL in the wild. Furthermore, the theoretical hand-constructed weights used in these studies have properties that don't match those of real LLMs. We also look for evidence in real models. We observe that ICL and GD have different sensitivity to the order in which they observe demonstrations. Finally, we probe and compare the ICL vs. GD hypothesis in a natural setting. We conduct comprehensive empirical analyses on language models pre-trained on natural data (LLaMa-7B). Our comparisons of three performance metrics highlight the inconsistent behavior of ICL and GD as a function of various factors such as datasets, models, and the number of demonstrations. We observe that ICL and GD modify the output distribution of language models differently. These results indicate that \emph{the equivalence between ICL and GD remains an open hypothesis} and calls for further studies.

Do pretrained Transformers Learn In-Context by Gradient Descent?

TL;DR

Abstract

Paper Structure (54 sections, 2 theorems, 8 equations, 24 figures, 1 table)

This paper contains 54 sections, 2 theorems, 8 equations, 24 figures, 1 table.

Introduction
Background
Sampling tasks and models
Sampling from the space of well-defined tasks.
Sampling from the space of pretrained models.
Standard Learning Setups
In-context learning (ICL).
Gradient Descent (GD).
The limiting assumptions in the study of ICL$\approx$GD hypothesis
Real LLMs are not pretrained with ICL objective
Changing the space of tasks.
Changing the space of models.
Hand-constructed weights and their limits
How does the model arrive at the correct $P$?
Are LLM weights this sparse?
...and 39 more sections

Key Result

Theorem 1

Given a pretrained model $M_{\Theta_0} \in \mathcal{M}$, an algorithm $\mathcal{A}$ equivalent to ICL, and demonstrations $S = \{(x_i, f(x_i)\}_{i=1}^N$ of a well defined task $f \sim \mathcal{F}$, let $\sigma_A, \sigma_B$ denote two orders of elements in $S$, such that $\Theta_{\sigma_A} \leftarrow

Figures (24)

Figure 1: is discussed in \ref{['subsec:existing']}. , in \ref{['subsec:42']}, \ref{['sec:empirical']};
Figure 2: We show that the sparsity ratio in LLaMA (averaged across layers with standard deviation shown with shade) is much less than required by previous works to implement GD. More plots in \ref{['sparse_rate']}.
Figure 3: GPT-J's ability to do ICL (on AGNews) does not change much over a time cross-section of training while the parameters change steadily.
Figure 4: Order Sensitivity (standard deviation in output probabilities over the vocabulary) of ICL and GD (and its variants SGD and Adam) as measured on the LLaMa-7B on AGNews. The std is taken across $10$ different orders of 8 ICL demos. More results are deferred to \ref{['extra_order']}.
Figure 5: Comparison of ICL and GD/$\widehat{\text{GD}}$ on our three metrics for the AGNews dataset (with 4 ICL demos). ICL lines in Token Overlap and Overlap Cosine Similarity are calculated between two different ICL output distributions (with different order of demonstrations in the prompt). A substantial gap between ICL and GD is highlighted by the gray diagonal lines.
...and 19 more figures

Theorems & Definitions (5)

Definition 1: Algorithmic equivalence to ICL
Theorem 1: Algorithmic equivalence implies the same order sensitivity
proof
Definition 2
Corollary 1

Do pretrained Transformers Learn In-Context by Gradient Descent?

TL;DR

Abstract

Do pretrained Transformers Learn In-Context by Gradient Descent?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (24)

Theorems & Definitions (5)