Table of Contents
Fetching ...

Measure-to-measure interpolation using Transformers

Borjan Geshkovski, Philippe Rigollet, Domènec Ruiz-Balet

TL;DR

This work treats Transformers as measure-to-measure flows on the unit sphere, formalizing token sequences as evolving probability measures under a nonlinear continuity equation. It provides a constructive, explicit parameter scheme—piecewise-constant in time—that enables a single Transformer to approximately map N input measures to N target measures, under transport-compatibility assumptions. The key steps are disentangling overlapping supports, clustering inputs into discrete atoms, and then matching these atoms to targets via neural-ODE flows, with rigorous bounds on the number of switches and the resulting approximation error in Wasserstein distance. The results illuminate the expressive power of attention-based models for arbitrary input measures and offer a principled protocol for measure transport using deep architectures, with detailed complexity considerations for the required controls.

Abstract

Transformers are deep neural network architectures that underpin the recent successes of large language models. Unlike more classical architectures that can be viewed as point-to-point maps, a Transformer acts as a measure-to-measure map implemented as specific interacting particle system on the unit sphere: the input is the empirical measure of tokens in a prompt and its evolution is governed by the continuity equation. In fact, Transformers are not limited to empirical measures and can in principle process any input measure. As the nature of data processed by Transformers is expanding rapidly, it is important to investigate their expressive power as maps from an arbitrary measure to another arbitrary measure. To that end, we provide an explicit choice of parameters that allows a single Transformer to match $N$ arbitrary input measures to $N$ arbitrary target measures, under the minimal assumption that every pair of input-target measures can be matched by some transport map.

Measure-to-measure interpolation using Transformers

TL;DR

This work treats Transformers as measure-to-measure flows on the unit sphere, formalizing token sequences as evolving probability measures under a nonlinear continuity equation. It provides a constructive, explicit parameter scheme—piecewise-constant in time—that enables a single Transformer to approximately map N input measures to N target measures, under transport-compatibility assumptions. The key steps are disentangling overlapping supports, clustering inputs into discrete atoms, and then matching these atoms to targets via neural-ODE flows, with rigorous bounds on the number of switches and the resulting approximation error in Wasserstein distance. The results illuminate the expressive power of attention-based models for arbitrary input measures and offer a principled protocol for measure transport using deep architectures, with detailed complexity considerations for the required controls.

Abstract

Transformers are deep neural network architectures that underpin the recent successes of large language models. Unlike more classical architectures that can be viewed as point-to-point maps, a Transformer acts as a measure-to-measure map implemented as specific interacting particle system on the unit sphere: the input is the empirical measure of tokens in a prompt and its evolution is governed by the continuity equation. In fact, Transformers are not limited to empirical measures and can in principle process any input measure. As the nature of data processed by Transformers is expanding rapidly, it is important to investigate their expressive power as maps from an arbitrary measure to another arbitrary measure. To that end, we provide an explicit choice of parameters that allows a single Transformer to match arbitrary input measures to arbitrary target measures, under the minimal assumption that every pair of input-target measures can be matched by some transport map.

Paper Structure

This paper contains 30 sections, 19 theorems, 191 equations, 6 figures.

Key Result

Theorem 1.1

Suppose $d\geqslant3$. Consider data eq: data such that Then for any $T>0$ and $\varepsilon>0$, there exists $\theta\in L^\infty((0,T);\Uptheta)$ such that for any $i\in\llbracket1,N\rrbracket$, the unique solution $\mu^i\in\mathscr{C}^0([0,T];\mathscr{P}(\mathbb{S}^{d-1}))$ to eq: cauchy.pb with data $\mu_0^i$ and parameters $\theta$ satisfies Moreover, $\theta$ can be chosen piecewise constant

Figures (6)

  • Figure 1: High-level overview of the proof of \ref{['thm: main.result']}.
  • Figure 2: Partitioning $\mathscr{C}^i\coloneqq\mathrm{supp}\,\mu^i_0$ into $M$ pieces with connected interiors.
  • Figure 3: Step 2: packing the piece $\mathscr{C}_k^i$ of the partition of $\mathscr{C}^i=\mathrm{supp}\,\mu_0^i$ with balls whose union has mass $\mu_0^i(\mathscr{C}_k^i)-\delta$. A single anchor point $x_k^i$ lies in this piece. The goal of Step 3 is to repeatedly use \ref{['lem: tubular.mass.movement']} to transfer the mass of each ball to the one highlighted in blue.
  • Figure 4: High-level overview of the proof of \ref{['prop: separation']}.
  • Figure 5: The geometric configuration of Step 1.
  • ...and 1 more figures

Theorems & Definitions (47)

  • Theorem 1.1
  • Theorem 1.2
  • Remark 1.3: Beyond $\mathsf{W}_2$
  • Proposition 2.1
  • proof
  • Proposition 2.2
  • proof
  • Remark 2.3
  • Proposition 3.1
  • Lemma 3.2
  • ...and 37 more