Table of Contents
Fetching ...

The Initialization Determines Whether In-Context Learning Is Gradient Descent

Shifeng Xie, Rui Yuan, Simone Rossi, Thomas Hannagan

TL;DR

This work interrogates the claim that in-context learning in large language models is equivalent to gradient descent, focusing on non-zero prior means in a linear regression setting. It shows that multi-head linear self-attention cannot generally replicate one-step GD when priors have non-zero mean, and identifies the query’s initial guess y_q as the crucial factor for closing the gap. To address this, the paper introduces y_q-LSA, a minimal input-side extension with a trainable initial guess that restores GD-equivalence, supported by theory and synthetic experiments. It also provides proof-of-concept evidence that explicit initial guesses can improve ICL in large language models on semantic similarity tasks. The results offer a principled avenue to enhance ICL via prompting strategies and lay groundwork for extending GD-level analyses to more complex transformer architectures.

Abstract

In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing an initial estimation of the query, referred to as the initial guess. We prove an upper bound on the number of heads needed for ICL linear regression setup. Our experiments confirm this result and further observe that a performance gap between one-step GD and multi-head LSA persists. To address this gap, we introduce yq-LSA, a simple generalization of single-head LSA with a trainable initial guess yq. We theoretically establish the capabilities of yq-LSA and provide experimental validation on linear regression tasks, thereby extending the theory that bridges ICL and GD. Finally, inspired by our findings in the case of linear regression, we consider widespread LLMs augmented with initial guess capabilities, and show that their performance is improved on a semantic similarity task.

The Initialization Determines Whether In-Context Learning Is Gradient Descent

TL;DR

This work interrogates the claim that in-context learning in large language models is equivalent to gradient descent, focusing on non-zero prior means in a linear regression setting. It shows that multi-head linear self-attention cannot generally replicate one-step GD when priors have non-zero mean, and identifies the query’s initial guess y_q as the crucial factor for closing the gap. To address this, the paper introduces y_q-LSA, a minimal input-side extension with a trainable initial guess that restores GD-equivalence, supported by theory and synthetic experiments. It also provides proof-of-concept evidence that explicit initial guesses can improve ICL in large language models on semantic similarity tasks. The results offer a principled avenue to enhance ICL via prompting strategies and lay groundwork for extending GD-level analyses to more complex transformer architectures.

Abstract

In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing an initial estimation of the query, referred to as the initial guess. We prove an upper bound on the number of heads needed for ICL linear regression setup. Our experiments confirm this result and further observe that a performance gap between one-step GD and multi-head LSA persists. To address this gap, we introduce yq-LSA, a simple generalization of single-head LSA with a trainable initial guess yq. We theoretically establish the capabilities of yq-LSA and provide experimental validation on linear regression tasks, thereby extending the theory that bridges ICL and GD. Finally, inspired by our findings in the case of linear regression, we consider widespread LLMs augmented with initial guess capabilities, and show that their performance is improved on a semantic similarity task.

Paper Structure

This paper contains 36 sections, 8 theorems, 153 equations, 6 figures, 1 table.

Key Result

Theorem 1

Let $d \in \mathbb{N}$, and consider the hypothesis classes ${\cal F}_{(d+1)-\mathsf{LSA}}$ and ${\cal F}_{(d+2)-\mathsf{LSA}}$ corresponding to multi-head LSA models with $H = d+1$ and $H = d+2$ attention heads, respectively. Then where ${\cal R}(f)$ is the ICL risk defined in eq:icl_risk.

Figures (6)

  • Figure 1: Training and evaluation loss curves of LSA with a non-zero prior mean. The dashed red line denotes the baseline loss achieved by one-step GD.
  • Figure 2: Training loss of multi-head LSA with different numbers of attention heads. In (\ref{['fig:loss_curves_heads']}), we visualize the training loss curves for models with different head configurations, each curve shows the expected ICL risk during parameter training (Adam updates of $\{W^Q,W^K,W^V,W^P\}$; no updates at test time). While (\ref{['fig:final_loss_heads']}) shows the final trained loss as a function of the number of heads.
  • Figure 3: Training loss of multi-head LSA under different prior means $\mathbf{w}_\star$. (\ref{['fig:loss_curves_prior']}) Training loss curves for different values of $\|\mathbf{w}_\star\|$. (\ref{['fig:final_loss_prior']}) Final trained loss as a function of $\|\mathbf{w}_\star\|^2$. Multi-head LSA matches the one-step GD loss only when ${\bf w}_\star=0$; for ${\bf w}_\star\neq 0$ the gap grows approximately linearly with $\|{\bf w}_\star\|_2^2$.
  • Figure 4: Training and final loss of multi-head LSA under different initial guess configurations.Left Training loss curves for various $\|y_{\text{q\_bias}}\|^2$, Middle Final trained loss as a function of $\|y_{\text{q\_bias}}\|^2$, Right Upper Training loss curves for various $\|\mathbf{y}_{\text{q\_guess}}\|$, and Right Lower Final trained loss as a function of $\|{\bf y}_{\text{q\_guess}}\|^2$. Multi-head LSA reaches the GD loss only when both the linear guess component and the bias vanish ($y_q={\bf w}_\star^\top {\bf x}_q$ and no offset).
  • Figure 5: Training loss and sensitivity analysis of $y_q$-LSA. (\ref{['fig:yq_loss']}) Training loss curves of $y_q$-LSA and one-step GD. (\ref{['fig:yq_metrics']}) Model behavior metrics including prediction norm difference, gradient norm difference, and cosine similarity.
  • ...and 1 more figures

Theorems & Definitions (14)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • Definition 3: $y_q$-LSA
  • Theorem 4
  • Lemma 2
  • Theorem 4
  • proof
  • Theorem 4
  • proof
  • ...and 4 more