Table of Contents
Fetching ...

Is In-Context Universality Enough? MLPs are Also Universal In-Context

Anastasis Kratsios, Takashi Furuya

TL;DR

The paper addresses whether in-context universality suffices to explain transformers' edge over classical models by proving that MLPs with trainable activation functions are also universal in-context over permutation-invariant contexts (PICs). It develops a rigorous non-Euclidean approximation framework based on $1$-Wasserstein (PIC) representations, showing that any uniformly continuous PIC-to-distribution map on a compact PIC set can be uniformly approximated by an MLP with explicit depth/width guarantees, with high-probability error bounds. The construction couples Voronoi-Lusin-type decompositions, piecewise-constant approximations, and exact Wasserstein-1 implementations via ReLU/ReQU networks, and it further shows how to transform such an MLP into a multi-head transformer with the same approximation power. Consequently, the transformer’s success cannot be attributed solely to in-context universality; it is likely driven by inductive biases or training stability, with the results further implying a transformer-equivalent quantitative universal approximation for PICs. The work also provides a transformerification pathway, establishing a quantitative link between MLP and transformer expressivity in the in-context setting and highlighting the role of training dynamics and architectural biases in practical performance.

Abstract

The success of transformers is often linked to their ability to perform in-context learning. Recent work shows that transformers are universal in context, capable of approximating any real-valued continuous function of a context (a probability measure over $\mathcal{X}\subseteq \mathbb{R}^d$) and a query $x\in \mathcal{X}$. This raises the question: Does in-context universality explain their advantage over classical models? We answer this in the negative by proving that MLPs with trainable activation functions are also universal in-context. This suggests the transformer's success is likely due to other factors like inductive bias or training stability.

Is In-Context Universality Enough? MLPs are Also Universal In-Context

TL;DR

The paper addresses whether in-context universality suffices to explain transformers' edge over classical models by proving that MLPs with trainable activation functions are also universal in-context over permutation-invariant contexts (PICs). It develops a rigorous non-Euclidean approximation framework based on -Wasserstein (PIC) representations, showing that any uniformly continuous PIC-to-distribution map on a compact PIC set can be uniformly approximated by an MLP with explicit depth/width guarantees, with high-probability error bounds. The construction couples Voronoi-Lusin-type decompositions, piecewise-constant approximations, and exact Wasserstein-1 implementations via ReLU/ReQU networks, and it further shows how to transform such an MLP into a multi-head transformer with the same approximation power. Consequently, the transformer’s success cannot be attributed solely to in-context universality; it is likely driven by inductive biases or training stability, with the results further implying a transformer-equivalent quantitative universal approximation for PICs. The work also provides a transformerification pathway, establishing a quantitative link between MLP and transformer expressivity in the in-context setting and highlighting the role of training dynamics and architectural biases in practical performance.

Abstract

The success of transformers is often linked to their ability to perform in-context learning. Recent work shows that transformers are universal in context, capable of approximating any real-valued continuous function of a context (a probability measure over ) and a query . This raises the question: Does in-context universality explain their advantage over classical models? We answer this in the negative by proving that MLPs with trainable activation functions are also universal in-context. This suggests the transformer's success is likely due to other factors like inductive bias or training stability.

Paper Structure

This paper contains 40 sections, 19 theorems, 83 equations, 5 figures.

Key Result

Proposition 2

Let $N,d\in \mathbb{N}_+$ and let $\mathcal{X}\subseteq \mathbb{R}^d$ be non-empty. Then, there are absolute constant $0<c\le C$ such that: for each $[X],[Y]\in \mathcal{P}_N^{N,d}(\mathcal{X})$ we have In particular, the map $\Phi:(\mathcal{P}_{N,N}(\mathcal{X}),\mathcal{W}_1)\to (\operatorname{Mat}^{d,N}_N/\sim,\operatorname{dist})$ is a homomorphism. Furthermore, $\mathcal{W}$ metrizes the nat

Figures (5)

  • Figure 1: As Measure in $\mathcal{P}_{7,6}(\mathbb{R}^2)$
  • Figure 2: Context as Tokens.
  • Figure 3: As Matrix in $\operatorname{Mat}_{7}^{2,6}/\sim$
  • Figure 5: The activation function in \ref{['eq:activation']}.
  • Figure 6: Our Regular Decomposition of the PIC Space $\mathcal{K}:$ The Retracted Voronoi cells (non-redish coloured regions) $C^{\delta_{\star}}_1,\dots,C^{\delta_{\star}}_K$, for $\delta_{\star}>0$, whose union makes up the "large" approximation region$\mathcal{K}\setminus \mathcal{K}^{\delta_{\star}}$. The reddish region symbolizes our "small" trifling region whereon a uniform approximation may fail.

Theorems & Definitions (25)

  • Definition 1: Permutation-Invariant Context
  • Proposition 2: Equivalence of $\mathcal{W}$ and the Natural Quotient Metric on PICs
  • Definition 3: $q$-Dimensional PIC
  • Theorem 4: MLPs are Universal Approximators for PICs
  • Corollary 5: Multihead Transformers version of Theorem \ref{['thrm:Main__SimpleVersion']}
  • Lemma 6: The Trifling Region is Small
  • Lemma 7: $\delta_{\ast}$-Separated Almost Partition of $\mathcal{K}$
  • Lemma 8: Optimal Piecewise Constant Approximator
  • Lemma 9: Piecewise Constant Partition of Unity on $\mathcal{K}\setminus\mathcal{K}^{\delta_{\ast}}$
  • Lemma 10
  • ...and 15 more