Table of Contents
Fetching ...

Pretrain-Test Task Alignment Governs Generalization in In-Context Learning

Mary I. Letey, Jacob A. Zavatone-Veth, Yue M. Lu, Cengiz Pehlevan

TL;DR

This work investigates how the structure of pretraining tasks governs generalization in in-context learning (ICL) by deriving an exact high‑dimensional generalization error for a solvable linear regression model with linear attention. It introduces a task‑alignment measure based on the alignment between pretraining and test covariances and shows that this alignment predicts ICL performance in both linear and nonlinear Transformer architectures. The analysis reveals a tradeoff between specialization and generalization, showing that increasing pretraining task diversity can either improve or harm test performance depending on alignment and sample regime. These results highlight pretrain–test task alignment as a key determinant of ICL generalization and suggest that curated task curricula can enhance the emergent algorithmic capabilities of Transformers.

Abstract

In-context learning (ICL) is a central capability of Transformer models, but the structures in data that enable its emergence and govern its robustness remain poorly understood. In this work, we study how the structure of pretraining tasks governs generalization in ICL. Using a solvable model for ICL of linear regression by linear attention, we derive an exact expression for ICL generalization error in high dimensions under arbitrary pretraining-testing task covariance mismatch. This leads to a new alignment measure that quantifies how much information about the pretraining task distribution is useful for inference at test time. We show that this measure directly predicts ICL performance not only in the solvable model but also in nonlinear Transformers. Our analysis further reveals a tradeoff between specialization and generalization in ICL: depending on task distribution alignment, increasing pretraining task diversity can either improve or harm test performance. Together, these results identify train-test task alignment as a key determinant of generalization in ICL.

Pretrain-Test Task Alignment Governs Generalization in In-Context Learning

TL;DR

This work investigates how the structure of pretraining tasks governs generalization in in-context learning (ICL) by deriving an exact high‑dimensional generalization error for a solvable linear regression model with linear attention. It introduces a task‑alignment measure based on the alignment between pretraining and test covariances and shows that this alignment predicts ICL performance in both linear and nonlinear Transformer architectures. The analysis reveals a tradeoff between specialization and generalization, showing that increasing pretraining task diversity can either improve or harm test performance depending on alignment and sample regime. These results highlight pretrain–test task alignment as a key determinant of ICL generalization and suggest that curated task curricula can enhance the emergent algorithmic capabilities of Transformers.

Abstract

In-context learning (ICL) is a central capability of Transformer models, but the structures in data that enable its emergence and govern its robustness remain poorly understood. In this work, we study how the structure of pretraining tasks governs generalization in ICL. Using a solvable model for ICL of linear regression by linear attention, we derive an exact expression for ICL generalization error in high dimensions under arbitrary pretraining-testing task covariance mismatch. This leads to a new alignment measure that quantifies how much information about the pretraining task distribution is useful for inference at test time. We show that this measure directly predicts ICL performance not only in the solvable model but also in nonlinear Transformers. Our analysis further reveals a tradeoff between specialization and generalization in ICL: depending on task distribution alignment, increasing pretraining task diversity can either improve or harm test performance. Together, these results identify train-test task alignment as a key determinant of generalization in ICL.

Paper Structure

This paper contains 36 sections, 5 theorems, 183 equations, 7 figures, 2 tables.

Key Result

Lemma 1

Simplified test losses. Consider IDG and ICL test distribution and corresponding error functions as given by (eq:idg_def) and (eq:icl_def). For fixed parameters $\Gamma \in \mathbb{R}^{d \times (d+1)}$ and data sampled according to test distribution $\mathcal{P}_\mathrm{IDG}$ or $\mathcal{P}_\mathrm where for $\bm{b}_k$ and $R_k$ defined by the pretraining task sample $\{w_1\,,...\,,w_k\}$ as Th

Figures (7)

  • Figure 1: Theoretical $e_\mathrm{ICL}$ (left panel) and $e_\mathrm{misalign}$ (right panel) curves plotted against numerical simulations of $\mathcal{E}(\Gamma^*)$ computed directly from sampled data. We choose $C_\mathrm{train}$ with uniform eigenvalue distribution: $C_\mathrm{train} \propto \text{diag}([d, d-1,\cdots,1])$ such that $\mathop{\mathrm{tr}}\nolimits(C_\mathrm{train}) = d$. We compare $C_\mathrm{test} = C_\mathrm{train}$ (red curves) with testing on single task directions, i.e., the "idx i/d" labels correspond to rank-1 test covariances $C_\mathrm{test}^i = d\bm{e}_i\bm{e}_i^\top$ spiked at index $i$. In this way, $C_\mathrm{test}^1$ captures the strongest task direction of $C_\mathrm{train}$ and $C_\mathrm{test}^d$ captures the weakest. Parameters:$d = 120$, $\alpha = 2$, $\tau = 4$, $\rho = 0.01$. Shading represents $\pm$std of numerical simulations. We calculate the simulation values of $e_\mathrm{misalign}$ in the right panel by subtracting $e_\mathrm{scalar}$ from the MSE simulation values $\mathcal{E}(\Gamma^*)$.
  • Figure 2: $e_\mathrm{ICL}(C_\mathrm{train},C_\mathrm{test})$ against alignment measures: $e_\mathrm{misalign}(C_\mathrm{train},C_\mathrm{test})$, $\mathop{\mathrm{tr}}\nolimits[C_\mathrm{test} F]$, $\mathop{\mathrm{tr}}\nolimits[C_\mathrm{test}C_\mathrm{train}^{-1}]$, and $1/\mathrm{CKA}(C_\mathrm{train},C_\mathrm{test})$ from left to right. $C_\mathrm{train}$ is fixed to be a diagonal matrix with powerlaw spectrum $C_\mathrm{train} \propto \mathrm{diag}([1^{-p},...,d^{-p}]$ with power $p=0.9$ and $\mathop{\mathrm{tr}}\nolimits[C_\mathrm{train}]=1$. $C_\mathrm{test}$ is varied over a range of different covariance matrices that are simultaneously diagonalizable with $C_\mathrm{train}$, specifically power spectrum with different powers (circles connected by solid line), and low-rank covariance matrices $C_r = \mathrm{diag}[(d/r)\mathbf{1}_r\,, \mathbf{0}_{d-r}]$ (triangular markers connected by dashed line). Changing the power of the powerlaw tests or the rank of the low-rank tests will make them either more or less aligned with $C_\mathrm{train}$. Parameters:$d=120$, $\alpha = 2$, $\tau = 4$, $\rho = 0.01$.
  • Figure 3: ICL test loss of a nonlinear transformer against different alignment measures. The setup of the covariances is identical to Figure \ref{['fig:FIGURE2_linearalignments']}, the only difference is that here ICL error is computed as the MSE on the test task as performed by a trained two-layer architecture with softmax attention and MLP connections. Our measure $e_\mathrm{misalign}$ achieves the best correlation with ICL error: the Spearman coefficients (measuring monotonicity, over all test covariances and averaged over the different $\kappa$ values) are 0.99 (ours), 0.98, 0.96, and 0.39 from left to right. Parameters:$d=20$, $\alpha = 2$, $\tau = 4$, $\rho = 0.01$.
  • Figure 4: Heatmap of theoretical ICL error given by (\ref{['eq:iclerrorformula']}) for simultaneously diagonalizable powerlaw task covariances $C_\mathrm{train}$ and $C_\mathrm{test}$ with spectral power $p_\mathrm{train}$ (variable) and $p_\mathrm{test}$ (fixed). The $x$-axis shows task diversity $\kappa$ and the $y$-axis shows the difference $p_\mathrm{train}-p_\mathrm{test}$ between task spectral powers. The colorbar shows the % improvement in error by training on $C_\mathrm{train}$ instead of $C_\mathrm{test}$. This shows that increasing spectral power in the pretraining tasks can markedly improve ICL error on the same test distribution. Parameters:$d=100$, $p_\mathrm{test} = 0.9$, $\alpha = 1$, $\tau = 4$, $\rho = 0.01$.
  • Figure 5: Demonstration of opposing eigenvalues for $\tau < 1$ values, at a range of $\kappa$ and $\alpha$ values. $C_\mathrm{train}$ here is the same as in Figure \ref{['fig:FIGURE2_linearalignments']}, i.e. powerlaw spectrum.
  • ...and 2 more figures

Theorems & Definitions (13)

  • Lemma 1
  • Remark 1
  • Lemma 2
  • proof
  • proof
  • Definition 1
  • Lemma 3
  • Lemma 4
  • proof
  • Remark 2
  • ...and 3 more