Transformers are Universal In-context Learners

Takashi Furuya; Maarten V. de Hoop; Gabriel Peyré

Transformers are Universal In-context Learners

Takashi Furuya, Maarten V. de Hoop, Gabriel Peyré

TL;DR

The paper develops a rigorous, measure-theoretic framework for transformers operating on arbitrarily long token contexts by treating contexts as probability distributions and using Wasserstein-based smoothness. It proves universal approximation results for both unmasked and masked (causal) transformers with fixed embedding dimensions and a number of heads that do not grow with precision, enabling a single transformer to approximate any continuous in-context mapping uniformly over compact token domains. The approach relies on expressing in-context mappings as compositions of context-dependent (attention) and context-free (MLP) operators, then leveraging dense families of cylindrical functions and Stone–Weierstrass arguments to show universal approximation, including a space-time lifting to handle causality. The masked-case result additionally requires Lipschitz regularity of the time-evolving context and identifiability to guarantee the reduction to a two-argument representation amenable to approximation. Overall, the work provides a foundational, general theory for the expressive power of Transformers as universal in-context learners operating on distributions, with implications for understanding long-context reasoning and mean-field limits in deep architectures.

Abstract

Transformers are deep architectures that define "in-context mappings" which enable predicting new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for a vision transformer). In this work, we study in particular the ability of these architectures to handle an arbitrarily large number of context tokens. To mathematically, uniformly address their expressivity, we consider the case that the mappings are conditioned on a context represented by a probability distribution of tokens which becomes discrete for a finite number of these. The relevant notion of smoothness then corresponds to continuity in terms of the Wasserstein distance between these contexts. We demonstrate that deep transformers are universal and can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains. A key aspect of our results, compared to existing findings, is that for a fixed precision, a single transformer can operate on an arbitrary (even infinite) number of tokens. Additionally, it operates with a fixed embedding dimension of tokens (this dimension does not increase with precision) and a fixed number of heads (proportional to the dimension). The use of MLPs between multi-head attention layers is also explicitly controlled. We consider both unmasked attentions (as used for the vision transformer) and masked causal attentions (as used for NLP and time series applications). We tackle the causal setting leveraging a space-time lifting to analyze causal attention as a mapping over probability distributions of tokens.

Transformers are Universal In-context Learners

TL;DR

Abstract

Paper Structure (30 sections, 17 theorems, 150 equations)

This paper contains 30 sections, 17 theorems, 150 equations.

Introduction
Universality, from neural networks to neural operators.
Mathematical modeling of transformers.
Universality of transformers.
Our contributions
Notation
Measure-theoretic in-context mappings
Attention as in context mappings on token ensembles
Classical definition.
In-context mappings form.
Composition of in-context mappings.
Measure-theoretic in-context mappings: Unmasked setting
Composition of in-context unmasked measure-theoretic mappings.
Measure-theoretic in-context mappings: Masked setting
Composition of in-context masked measure-theoretic mappings.
...and 15 more sections

Key Result

Theorem 1

Let $\Omega \subset \mathbb{R}^d$ be a compact set and $\Lambda^\star : \mathcal{P}(\Omega) \times \Omega \rightarrow \mathbb{R}^{d'}$ be continuous, where $\mathcal{P}(\Omega)$ is endowed with the weak$^*$ topology. Then for all $\varepsilon>0$, there exist $L$ and parameters $(\theta_\ell,\xi_\ell with $d_{\mathrm{in}}(\theta_\ell) \leq d+3d'$, $d_{\mathrm{head}}(\theta_\ell) = k(\theta_\ell) =

Theorems & Definitions (35)

Theorem 1
Proposition 1
proof
Lemma 1
Lemma 2
Lemma 3
Definition 1: Lipschitz contexts
Definition 2: Masked measure
Definition 3: Causal identifiable map
Theorem 2
...and 25 more

Transformers are Universal In-context Learners

TL;DR

Abstract

Transformers are Universal In-context Learners

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (35)