In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization
Ruiqi Zhang, Jingfeng Wu, Peter L. Bartlett
TL;DR
The paper analyzes in-context learning for linear regression with a Gaussian prior that has a non-zero mean, showing that a Linear Transformer Block (LTB) with an MLP component achieves nearly Bayes-optimal ICL risk, while a linear self-attention (LSA) alone incurs a non-negligible approximation gap. It establishes a precise correspondence between LT B and one-step gradient descent with learnable initialization (GD-β), proving that GD-β estimators can be implemented by LT B and that optimal LT B estimators effectively reduce to GD-β estimators. The authors derive the globally optimal GD-β and LT B solutions, showing they can match Bayes-optimal performance under reasonable signal-to-noise constraints, and prove convergence of gradient flow for GD-β despite non-convexity. They provide empirical GPT-2 experiments illustrating the essential role of the MLP in reducing approximation error when the shared signal is present. Overall, the work clarifies why incorporating the MLP in Transformers improves ICL in structured tasks and connects practical architectures to interpretable optimization procedures.
Abstract
We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a \emph{non-zero mean}, we show that LTB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between LTB and one-step gradient descent estimators with learnable initialization ($\mathsf{GD}\text{-}\mathbfβ$), in the sense that every $\mathsf{GD}\text{-}\mathbfβ$ estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a $\mathsf{GD}\text{-}\mathbfβ$ estimator. Finally, we show that $\mathsf{GD}\text{-}\mathbfβ$ estimators can be efficiently optimized with gradient flow, despite a non-convex training objective. Our results reveal that LTB achieves ICL by implementing $\mathsf{GD}\text{-}\mathbfβ$, and they highlight the role of MLP layers in reducing approximation error.
