Table of Contents
Fetching ...

How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs

Samet Demir, Zafer Dogan

TL;DR

This paper investigates in-context learning (ICL) in pretrained Transformers that include nonlinear MLP heads, under high-dimensional conditions with multiple heterogeneous data sources. It shows that a Transformer with linear attention and a two-layer nonlinear MLP head—trained via a single gradient step on the first layer and fully on the second—becomes asymptotically equivalent, in terms of ICL error, to a finite-degree polynomial predictor, leveraging Gaussian universality and Hermite expansions. The authors demonstrate that nonlinear MLPs significantly enhance ICL on nonlinear tasks and reveal how data mixing, structured covariances, and low target noise define high-quality data sources that enable feature learning. Empirical results across synthetic and real-world distributions, including multilingual sentiment analysis, confirm the theory and illustrate practical implications for designing data mixtures and architectures to optimize ICL. Overall, the work provides a rigorous bridge between neural architecture, data distribution, and ICL performance with concrete guidance for real-world use cases.

Abstract

Pretrained Transformers demonstrate remarkable in-context learning (ICL) capabilities, enabling them to adapt to new tasks from demonstrations without parameter updates. However, theoretical studies often rely on simplified architectures (e.g., omitting MLPs), data models (e.g., linear regression with isotropic inputs), and single-source training, limiting their relevance to realistic settings. In this work, we study ICL in pretrained Transformers with nonlinear MLP heads on nonlinear tasks drawn from multiple data sources with heterogeneous input, task, and noise distributions. We analyze a model where the MLP comprises two layers, with the first layer trained via a single gradient step and the second layer fully optimized. Under high-dimensional asymptotics, we prove that such models are equivalent in ICL error to structured polynomial predictors, leveraging results from the theory of Gaussian universality and orthogonal polynomials. This equivalence reveals that nonlinear MLPs meaningfully enhance ICL performance, particularly on nonlinear tasks, compared to linear baselines. It also enables a precise analysis of data mixing effects: we identify key properties of high-quality data sources (low noise, structured covariances) and show that feature learning emerges only when the task covariance exhibits sufficient structure. These results are validated empirically across various activation functions, model sizes, and data distributions. Finally, we experiment with a real-world scenario involving multilingual sentiment analysis where each language is treated as a different source. Our experimental results for this case exemplify how our findings extend to real-world cases. Overall, our work advances the theoretical foundations of ICL in Transformers and provides actionable insight into the role of architecture and data in ICL.

How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs

TL;DR

This paper investigates in-context learning (ICL) in pretrained Transformers that include nonlinear MLP heads, under high-dimensional conditions with multiple heterogeneous data sources. It shows that a Transformer with linear attention and a two-layer nonlinear MLP head—trained via a single gradient step on the first layer and fully on the second—becomes asymptotically equivalent, in terms of ICL error, to a finite-degree polynomial predictor, leveraging Gaussian universality and Hermite expansions. The authors demonstrate that nonlinear MLPs significantly enhance ICL on nonlinear tasks and reveal how data mixing, structured covariances, and low target noise define high-quality data sources that enable feature learning. Empirical results across synthetic and real-world distributions, including multilingual sentiment analysis, confirm the theory and illustrate practical implications for designing data mixtures and architectures to optimize ICL. Overall, the work provides a rigorous bridge between neural architecture, data distribution, and ICL performance with concrete guidance for real-world use cases.

Abstract

Pretrained Transformers demonstrate remarkable in-context learning (ICL) capabilities, enabling them to adapt to new tasks from demonstrations without parameter updates. However, theoretical studies often rely on simplified architectures (e.g., omitting MLPs), data models (e.g., linear regression with isotropic inputs), and single-source training, limiting their relevance to realistic settings. In this work, we study ICL in pretrained Transformers with nonlinear MLP heads on nonlinear tasks drawn from multiple data sources with heterogeneous input, task, and noise distributions. We analyze a model where the MLP comprises two layers, with the first layer trained via a single gradient step and the second layer fully optimized. Under high-dimensional asymptotics, we prove that such models are equivalent in ICL error to structured polynomial predictors, leveraging results from the theory of Gaussian universality and orthogonal polynomials. This equivalence reveals that nonlinear MLPs meaningfully enhance ICL performance, particularly on nonlinear tasks, compared to linear baselines. It also enables a precise analysis of data mixing effects: we identify key properties of high-quality data sources (low noise, structured covariances) and show that feature learning emerges only when the task covariance exhibits sufficient structure. These results are validated empirically across various activation functions, model sizes, and data distributions. Finally, we experiment with a real-world scenario involving multilingual sentiment analysis where each language is treated as a different source. Our experimental results for this case exemplify how our findings extend to real-world cases. Overall, our work advances the theoretical foundations of ICL in Transformers and provides actionable insight into the role of architecture and data in ICL.

Paper Structure

This paper contains 31 sections, 12 theorems, 60 equations, 6 figures.

Key Result

Lemma 4.9

$$ Under Assumption assumption:F_and_w, the entries of ${\bm{F}}$ are i.i.d. $\mathcal{N}(0, 1/\text{Tr}(\text{Cov}( \text{vec}({\bm{H}}_{{\bm{Z}}}) )))$, ensuring that the components of ${\bm{F}} \text{vec}({\bm{H}}_{{\bm{Z}}})$ have unit variance. Let ${\bm{f}}_i$ denote the $i$-th column of ${\bm for all $i \in \{1, \dots, k\}$.

Figures (6)

  • Figure 1: Effects of sample size, context length, and hidden dimension on the ICL errors for linear Transformer (\ref{['eq:linear_transformer']}), Transformer with a nonlinear MLP (\ref{['eq:nonlinear_transformer']}), and the equivalent model (\ref{['eq:equivalent_polynomial_model']}). The number of data sources is $\mathcal{S} = 2$ with equal probability: $\mathbb{P}(s=0) = \mathbb{P}(s=1) = 1/2$. For the input vectors, ${\bm{\mu}}_{x,s} = \mathbf{0}$ and ${\bm{\Sigma}}_{x,s} = {\bm{I}}_d$ for all $s$. For the task vectors, ${\bm{\mu}}_{\xi,s} = \mathbf{0}$ for all $s$ while ${\bm{\Sigma}}_{\xi,0} = {\bm{I}}$ and ${\bm{\Sigma}}_{\xi,1} = {\bm{I}}_d + \theta \boldsymbol{\gamma}\boldsymbol{\gamma}^T$ for some $\theta \asymp d^2$ and $\boldsymbol{\gamma} \in \mathbb{R}^d$ with $\|\boldsymbol{\gamma}\|_2 = 1$. The target function $\phi_s$ is ReLU, and the target noise is $\Delta_s = 0.01$ for all $s$. The Transformer is used with two different activation functions: ReLU and tanh. Here, $d=80$, $\eta \asymp d^2$, and $\lambda = 5 \times 10^{-5}$. The degree of the equivalent polynomial model $p$ is set to $4$. The average over 20 Monte Carlo runs is plotted.
  • Figure 2: Data mixing effects on the ICL error: performance of Transformers with ReLU activation (denoted with diamonds (\ref{['eq:nonlinear_transformer']})), and the equivalent model (denoted with lines (\ref{['eq:equivalent_polynomial_model']})) are illustrated for different mixing ratios $\rho$. For this figure, $\rho$ controls the data source mixture for the training, as we define $\mathbb{P}(s=0) := 1-\rho$ and $\mathbb{P}(s=1) := \rho$, while the ICL error is calculated as an average over data sources (\ref{['eq:ICL_error']}). Here, $d=80$, $l=d$, $n = k = 0.5d^2$, $\lambda = 5 \times 10^{-5}$, $\eta \asymp d^2$ and the target function $\phi_s$ is ReLU for all $s$. We initially consider the following setting: for the input vectors, ${\bm{\mu}}_{x,s} = \mathbf{0}$ and ${\bm{\Sigma}}_{x,s} = {\bm{I}}_d$ for all $s$; for the task vectors, ${\bm{\mu}}_{\xi,s} = \mathbf{0}$ and ${\bm{\Sigma}}_{\xi,s} = {\bm{I}}$ for all $s$; and the target noise is $\Delta_s = 0.01$ for all $s$. For each subfigure, we modify one data property while keeping the rest same and show its effect on the ICL error: in (a), we focus on the input covariance of the second data source by using ${\bm{\Sigma}}_{x,1} = {\bm{I}}_d + \theta_{x} \boldsymbol{\gamma}_{x}\boldsymbol{\gamma}_{x}^T$ such that $\|{\bm{\Sigma}}_{x,1}\|$ is changed by varying $\theta_{x}$; in (b), we concentrate on the task covariance of the second data source by using ${\bm{\Sigma}}_{\xi,1} = {\bm{I}}_d + \theta_{\xi} \boldsymbol{\gamma}_{\xi}\boldsymbol{\gamma}_{\xi}^T$ such that $\|{\bm{\Sigma}}_{\xi,1}\|$ is changed by varying $\theta_{\xi}$; and in (c), we modify the noise of the second data source $\Delta_1$ while the noise of the first source $\Delta_0$ is set to $0.2$ and fixed. We set the degree of the equivalent polynomial model to $p=5$. The average of 20 Monte Carlo runs is plotted.
  • Figure 3: Feature learning with data mixing on synthetic and real-world data: the effect of different step sizes $\eta$ is illustrated. For the synthetic data scenarios in (a) and (b), we start with the same initial setting as Figure \ref{['figure:data_mixing']}. In (a), we just modify the covariance of the input vectors from the second data source, while we only change the covariance of the task vectors from the second data source in (b). In each case, we add an additional rank-one structure to the covariance, which is the same as Figure \ref{['figure:data_mixing']}. We plot the average of 20 Monte Carlo trials for (a) and (b). For the real-world data scenario in (c), we focus on the effect of feature learning on ICL errors for multilingual sentiment analysis using the Multilingual Amazon Reviews Corpus keung2020multilingual. This dataset contains customer reviews (with text and star ratings) in multiple languages, such as English and German. By treating English and German reviews as two distinct data sources, we can vary the mixing ratio across languages, allowing us to evaluate our framework in a more realistic setting. We consider English reviews as source 1 and German reviews as source 0, so a mixing ratio of $\rho = 1$ corresponds to entirely English data, and $\rho=0$ to entirely German data. For labels $\{y_i\}$, the review star ratings are demeaned and scaled to lie in the range $[-1,1]$, making the task regression-like. For inputs $\{{\bm{x}}_i\}$, each review text is embedded using the multilingual text embedding model called "multilingual-e5-small" wang2024multilingual, the generated 384-dimensional embeddings are reduced to 64 dimensions via PCA (principal component analysis), and then normalized. We group $l$ input-label pairs (of the same language) together to form a context so that we make the problem compatible with our ICL setting. The rest of the details for (c) are $d=l=64$, $n=k=0.25d^2$, and $\lambda=5\times 10^{-5}$. The degree of the equivalent polynomial model is set to $5$. The mean of 100 Monte Carlo trials is illustrated in (c).
  • Figure 4: Per-source ICL errors in the case of Figure \ref{['figure:data_mixing']}(a): impact of varying input covariance of source 1 on the per-source ICL errors.
  • Figure 5: Per-source ICL errors in the case of Figure \ref{['figure:data_mixing']}(b): effect of altering the task covariance of source 1 on the per-source ICL errors.
  • ...and 1 more figures

Theorems & Definitions (24)

  • Lemma 4.9: Asymptotic distribution of ${\bm{F}} \text{vec}({\bm{H}}_{\bm{Z}})$
  • proof
  • Corollary 4.10: Joint distribution of $({\bm{F}}\text{vec}({\bm{H}}_{{\bm{Z}}}), \boldsymbol{\xi}^T {\bm{x}}_{\ell+1})$ conditioned on $\boldsymbol{\xi}$ and $s$
  • Lemma 4.11: Decomposition of the gradient matrix
  • proof
  • Theorem 4.12: Equivalent polynomial model
  • proof
  • Lemma B.1
  • proof
  • Corollary B.2
  • ...and 14 more