Is In-Context Universality Enough? MLPs are Also Universal In-Context
Anastasis Kratsios, Takashi Furuya
TL;DR
The paper addresses whether in-context universality suffices to explain transformers' edge over classical models by proving that MLPs with trainable activation functions are also universal in-context over permutation-invariant contexts (PICs). It develops a rigorous non-Euclidean approximation framework based on $1$-Wasserstein (PIC) representations, showing that any uniformly continuous PIC-to-distribution map on a compact PIC set can be uniformly approximated by an MLP with explicit depth/width guarantees, with high-probability error bounds. The construction couples Voronoi-Lusin-type decompositions, piecewise-constant approximations, and exact Wasserstein-1 implementations via ReLU/ReQU networks, and it further shows how to transform such an MLP into a multi-head transformer with the same approximation power. Consequently, the transformer’s success cannot be attributed solely to in-context universality; it is likely driven by inductive biases or training stability, with the results further implying a transformer-equivalent quantitative universal approximation for PICs. The work also provides a transformerification pathway, establishing a quantitative link between MLP and transformer expressivity in the in-context setting and highlighting the role of training dynamics and architectural biases in practical performance.
Abstract
The success of transformers is often linked to their ability to perform in-context learning. Recent work shows that transformers are universal in context, capable of approximating any real-valued continuous function of a context (a probability measure over $\mathcal{X}\subseteq \mathbb{R}^d$) and a query $x\in \mathcal{X}$. This raises the question: Does in-context universality explain their advantage over classical models? We answer this in the negative by proving that MLPs with trainable activation functions are also universal in-context. This suggests the transformer's success is likely due to other factors like inductive bias or training stability.
