Table of Contents
Fetching ...

Exact Sequence Interpolation with Transformers

Albert Alcalde, Giovanni Fantuzzi, Enrique Zuazua

TL;DR

This work proves that transformers can exactly interpolate real-valued sequence-to-sequence mappings with outputs of length $m^j$, independent of the input lengths, via an explicit construction that alternates feed-forward and self-attention layers. It establishes both hardmax and softmax self-attention regimes, achieving exact interpolation with block counts $L = 2\sum m^j + 2N + 1$ (hardmax) or $L = 2\sum m^j + 3N$ (softmax) and parameter counts $P = O(d\sum m^j)$, using low-rank or identity-like attention matrices. The methodology hinges on a sequencing strategy: separate overlapping sequences, select leaders to represent outputs, collapse tokens through clustering, and interpolate onto the target sequences, with precise constructions for both tight mathematical control and practical relevance. These results illuminate why transformers perform well on long-input, short-output tasks and offer insights into regularized training dynamics, including linear scaling of optimality with the regularization parameter when exact interpolation is achievable.

Abstract

We prove that transformers can exactly interpolate datasets of finite input sequences in $\mathbb{R}^d$, $d\geq 2$, with corresponding output sequences of smaller or equal length. Specifically, given $N$ sequences of arbitrary but finite lengths in $\mathbb{R}^d$ and output sequences of lengths $m^1, \dots, m^N \in \mathbb{N}$, we construct a transformer with $\mathcal{O}(\sum_{j=1}^N m^j)$ blocks and $\mathcal{O}(d \sum_{j=1}^N m^j)$ parameters that exactly interpolates the dataset. Our construction provides complexity estimates that are independent of the input sequence length, by alternating feed-forward and self-attention layers and by capitalizing on the clustering effect inherent to the latter. Our novel constructive method also uses low-rank parameter matrices in the self-attention mechanism, a common feature of practical transformer implementations. These results are first established in the hardmax self-attention setting, where the geometric structure permits an explicit and quantitative analysis, and are then extended to the softmax setting. Finally, we demonstrate the applicability of our exact interpolation construction to learning problems, in particular by providing convergence guarantees to a global minimizer under regularized training strategies. Our analysis contributes to the theoretical understanding of transformer models, offering an explanation for their excellent performance in exact sequence-to-sequence interpolation tasks.

Exact Sequence Interpolation with Transformers

TL;DR

This work proves that transformers can exactly interpolate real-valued sequence-to-sequence mappings with outputs of length , independent of the input lengths, via an explicit construction that alternates feed-forward and self-attention layers. It establishes both hardmax and softmax self-attention regimes, achieving exact interpolation with block counts (hardmax) or (softmax) and parameter counts , using low-rank or identity-like attention matrices. The methodology hinges on a sequencing strategy: separate overlapping sequences, select leaders to represent outputs, collapse tokens through clustering, and interpolate onto the target sequences, with precise constructions for both tight mathematical control and practical relevance. These results illuminate why transformers perform well on long-input, short-output tasks and offer insights into regularized training dynamics, including linear scaling of optimality with the regularization parameter when exact interpolation is achievable.

Abstract

We prove that transformers can exactly interpolate datasets of finite input sequences in , , with corresponding output sequences of smaller or equal length. Specifically, given sequences of arbitrary but finite lengths in and output sequences of lengths , we construct a transformer with blocks and parameters that exactly interpolates the dataset. Our construction provides complexity estimates that are independent of the input sequence length, by alternating feed-forward and self-attention layers and by capitalizing on the clustering effect inherent to the latter. Our novel constructive method also uses low-rank parameter matrices in the self-attention mechanism, a common feature of practical transformer implementations. These results are first established in the hardmax self-attention setting, where the geometric structure permits an explicit and quantitative analysis, and are then extended to the softmax setting. Finally, we demonstrate the applicability of our exact interpolation construction to learning problems, in particular by providing convergence guarantees to a global minimizer under regularized training strategies. Our analysis contributes to the theoretical understanding of transformer models, offering an explanation for their excellent performance in exact sequence-to-sequence interpolation tasks.

Paper Structure

This paper contains 35 sections, 12 theorems, 75 equations, 8 figures.

Key Result

Theorem 1.4

Fix $N\in \mathbb{N}$, $d\geq 2$ and a dataset of sequences $\{(X^j, Y^j)\}_{j\in [N]}$ satisfying ass:dataset. There exists a hardmax transformer $\mathop{\mathrm{\mathrm{T}}}\nolimits^0$ with feed-forward layers of width $d' \leq 4$, such that

Figures (8)

  • Figure 1: In the left panel, the training loss (log scale) for \ref{['ex:tikhonov']}. The red dotted line corresponds to the Tikhonov threshold $\varepsilon \|\theta_{\text{exact}} \|_2^2$. In the right panel, the minimum of the training loss for different choices of the regularization parameter $\varepsilon$.
  • Figure 2: Geometric interpretation of \ref{['eq:hardmaxFormulation']} for $i=1$ with $A=I$. Tokens $x_2$ and $x_3$ have the largest orthogonal projection onto $Ax_1 = x_1$, so $\mathcal{C}_i(X,A) = \{2,3\}$.
  • Figure 3: Schematic of the transformer architecture described in \ref{['ss:theTransformer']}.
  • Figure 4: Initial configuration of tokens in the top row, and the asymptotic configuration in the bottom row for $(a)$\ref{['lem:rank1-asymptotics']}, $(b)$\ref{['lem:fullConc']} and $(c)$\ref{['lem:noConc']}.
  • Figure 5: Proof sketch of \ref{['thm:mainResult']} applied to $N=3$ input sequences in $\mathbb{R}^2$ (denoted with blue circles, orange triangles and green stars, respectively), with output sequences of length $m^1 = 1, m^2 = 1$, and $m^3 = 2$ (denoted with diamonds colored accordingly). The tokens of the initial sequences are shown in panel $(a)$. Panels $(b)$--$(e)$ show the tokens of each sequence after the separation, leader selection, collapse and interpolation steps of the proof. Note that, after the collapse step in panel $(d)$, only tokens selected as leaders are visible. After all steps, sequences match their outputs, as shown in panel $(e)$.
  • ...and 3 more figures

Theorems & Definitions (25)

  • Example 1.1
  • Theorem 1.4: Exact interpolation with hardmax transformers
  • Remark 1.5
  • Theorem 1.6: Exact interpolation with softmax transformers
  • Proposition 2.1
  • Example 2.2
  • Lemma 4.1
  • proof
  • Lemma 4.2: Full clustering
  • proof
  • ...and 15 more