In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

Ruiqi Zhang; Jingfeng Wu; Peter L. Bartlett

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

Ruiqi Zhang, Jingfeng Wu, Peter L. Bartlett

TL;DR

The paper analyzes in-context learning for linear regression with a Gaussian prior that has a non-zero mean, showing that a Linear Transformer Block (LTB) with an MLP component achieves nearly Bayes-optimal ICL risk, while a linear self-attention (LSA) alone incurs a non-negligible approximation gap. It establishes a precise correspondence between LT B and one-step gradient descent with learnable initialization (GD-β), proving that GD-β estimators can be implemented by LT B and that optimal LT B estimators effectively reduce to GD-β estimators. The authors derive the globally optimal GD-β and LT B solutions, showing they can match Bayes-optimal performance under reasonable signal-to-noise constraints, and prove convergence of gradient flow for GD-β despite non-convexity. They provide empirical GPT-2 experiments illustrating the essential role of the MLP in reducing approximation error when the shared signal is present. Overall, the work clarifies why incorporating the MLP in Transformers improves ICL in structured tasks and connects practical architectures to interpretable optimization procedures.

Abstract

We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a \emph{non-zero mean}, we show that LTB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between LTB and one-step gradient descent estimators with learnable initialization ($\mathsf{GD}\text{-}\mathbfβ$), in the sense that every $\mathsf{GD}\text{-}\mathbfβ$ estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a $\mathsf{GD}\text{-}\mathbfβ$ estimator. Finally, we show that $\mathsf{GD}\text{-}\mathbfβ$ estimators can be efficiently optimized with gradient flow, despite a non-convex training objective. Our results reveal that LTB achieves ICL by implementing $\mathsf{GD}\text{-}\mathbfβ$, and they highlight the role of MLP layers in reducing approximation error.

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

TL;DR

Abstract

), in the sense that every

estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a

estimator. Finally, we show that

estimators can be efficiently optimized with gradient flow, despite a non-convex training objective. Our results reveal that LTB achieves ICL by implementing

, and they highlight the role of MLP layers in reducing approximation error.

Paper Structure (58 sections, 19 theorems, 221 equations, 1 table)

This paper contains 58 sections, 19 theorems, 221 equations, 1 table.

Introduction
Our contributions.
Paper organization.
Notation.
Related Works
Empirical results for ICL in controlled settings.
Transformer implements gradient descent.
Preliminaries
Model input.
A Transformer block.
A linear Transformer block.
A linear self-attention.
Linear regression tasks with a shared signal.
ICL risk.
Benefits of the MLP Component
...and 43 more sections

Key Result

Theorem 4.1

Consider the ICL risk defined by eqn.def.icl.risk and the two hypothesis classes $\mathcal{F}_{\mathsf{LTB}}$ and $\mathcal{F}_\mathsf{LSA}$. Suppose that Assumption assumption.data holds. Then we have

Theorems & Definitions (34)

Theorem 4.1: Approximation gap
Lemma 5.1
Theorem 5.2: Optimal ${\mathsf{GD}\text{-}\boldsymbol{\beta}}$ models
Theorem 5.3: Optimal LTB models
Lemma 6.1: Bayes optimal ICL
Corollary 6.2
Theorem 6.3
Definition 1.1: Variable Transformation
proof
proof
...and 24 more

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

TL;DR

Abstract

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (34)