Table of Contents
Fetching ...

Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time

Blake Bordelon, Mary I. Letey, Cengiz Pehlevan

TL;DR

This toy model enables computation of exact asymptotics for the risk as well as derivation of powerlaws under source/capacity conditions for the ICL tasks and provides a new solvable toy model of neural scaling laws.

Abstract

We study in-context learning (ICL) of linear regression in a deep linear self-attention model, characterizing how performance depends on various computational and statistical resources (width, depth, number of training steps, batch size and data per context). In a joint limit where data dimension, context length, and residual stream width scale proportionally, we analyze the limiting asymptotics for three ICL settings: (1) isotropic covariates and tasks (ISO), (2) fixed and structured covariance (FS), and (3) where covariances are randomly rotated and structured (RRS). For ISO and FS settings, we find that depth only aids ICL performance if context length is limited. Alternatively, in the RRS setting where covariances change across contexts, increasing the depth leads to significant improvements in ICL, even at infinite context length. This provides a new solvable toy model of neural scaling laws which depends on both width and depth of a transformer and predicts an optimal transformer shape as a function of compute. This toy model enables computation of exact asymptotics for the risk as well as derivation of powerlaws under source/capacity conditions for the ICL tasks.

Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time

TL;DR

This toy model enables computation of exact asymptotics for the risk as well as derivation of powerlaws under source/capacity conditions for the ICL tasks and provides a new solvable toy model of neural scaling laws.

Abstract

We study in-context learning (ICL) of linear regression in a deep linear self-attention model, characterizing how performance depends on various computational and statistical resources (width, depth, number of training steps, batch size and data per context). In a joint limit where data dimension, context length, and residual stream width scale proportionally, we analyze the limiting asymptotics for three ICL settings: (1) isotropic covariates and tasks (ISO), (2) fixed and structured covariance (FS), and (3) where covariances are randomly rotated and structured (RRS). For ISO and FS settings, we find that depth only aids ICL performance if context length is limited. Alternatively, in the RRS setting where covariances change across contexts, increasing the depth leads to significant improvements in ICL, even at infinite context length. This provides a new solvable toy model of neural scaling laws which depends on both width and depth of a transformer and predicts an optimal transformer shape as a function of compute. This toy model enables computation of exact asymptotics for the risk as well as derivation of powerlaws under source/capacity conditions for the ICL tasks.

Paper Structure

This paper contains 56 sections, 118 equations, 9 figures.

Figures (9)

  • Figure 1: Deep linear self attention models trained with SGD on the ICL task with isotropic covariates with $D = 32$. (a) Training dynamics for varying $\alpha$. (b) Increasing depth $L$ can improve ICL predictions, especially for $\alpha \approx 1$. (c) The final loss is well predicted by the theory of $L$ steps of gradient descent with optimal learning rate for each $(\alpha,L)$ pair.
  • Figure 2: The loss landscape for the reduced $\Gamma$ model with $\bm\Gamma=\gamma \bm I$ corresponding to the gradient flow limit. This limit is equivalent to optimal step size selection for in-context GD. (a)-(b) The effect of depth $L$ and context length $\alpha$ on the loss. (c) Larger noise $\sigma$ decreases the optimal $\gamma$.
  • Figure 3: Pretraining on FS ICL covariates leads to a solution that does not require depth but is brittle to distribution shift. (a) Evolution of the eigenvalues $\gamma_k(t)$ of the $\bm\Gamma(t)$ matrix for depth $L = 4$ as a function of pretraining time $t$ compared with infinite depth $L \to \infty$ theory (dashed black). (b) For powerlaw covariates, all depth models converge as a power law in $t$. There is no asymptotic benefit to increasing depth beyond $L=1$. (c) The ICL solution obtained when training from fixed covariance is brittle to changes in the covariance $\bm\Sigma \to \exp( - \theta \bm S) \bm\Sigma \exp( \theta \bm S)$.
  • Figure 4: Loss dynamics for powerlaw data. (a) Varying the source exponent $\beta$, we see that the scaling with pretraining time has exponent $\frac{\beta}{2+\beta}$. (b) The loss landscape across depths $L$ for the scalar $\gamma$ parameter exhibits minima at $\gamma \approx L$. (c) The training dynamics of the reduced-$\Gamma$ model exhibit $t^{-\beta/(2+\beta)}$ decay before hitting an asymptote which scales as $L^{-\beta}$.
  • Figure 5: Increasing width and depth alone is insufficient to obtain monotonic improvements on powerlaw data with random covariance across contexts. (a) Scaling only width leads to a depth bottleneck (dashed red line). (b) Scaling only depth leads to a width bottleneck (dashed red line). (c) Increasing $N$ and $L$ simultaneously achieves monotonic improvement with compute.
  • ...and 4 more figures