Table of Contents
Fetching ...

Concise One-Layer Transformers Can Do Function Evaluation (Sometimes)

Lena Strobl, Dana Angluin, Robert Frank

TL;DR

This work analyzes the computational expressiveness of concise transformer architectures for function evaluation tasks mapping $[n]\to[n]$. It establishes that 1-layer leftmost-hard-attention transformers can implement function evaluation for several input presentations with $c$-size polylogarithmic in $n$, while the challenging case of consecutive keys with permuted keys requires $c$-size at least $\Omega(n\log n)$ for hard attention, though a 1-layer softmax-attention construction achieves $O(n\log n)$ and a 2-layer variant also suffices. The paper provides explicit constructions and lower/upper bounds, plus VC-dimension and probabilistic analyses, to map the theoretical limits. Empirical experiments show alignment between the proven capabilities and learnability: concisely representable cases are learnable with small networks and learned positional embeddings, whereas the hardest case often requires deeper models, reflecting a link between conciseness and trainability.

Abstract

While transformers have proven enormously successful in a range of tasks, their fundamental properties as models of computation are not well understood. This paper contributes to the study of the expressive capacity of transformers, focusing on their ability to perform the fundamental computational task of evaluating an arbitrary function from $[n]$ to $[n]$ at a given argument. We prove that concise 1-layer transformers (i.e., with a polylog bound on the product of the number of heads, the embedding dimension, and precision) are capable of doing this task under some representations of the input, but not when the function's inputs and values are only encoded in different input positions. Concise 2-layer transformers can perform the task even with the more difficult input representation. Experimentally, we find a rough alignment between what we have proven can be computed by concise transformers and what can be practically learned.

Concise One-Layer Transformers Can Do Function Evaluation (Sometimes)

TL;DR

This work analyzes the computational expressiveness of concise transformer architectures for function evaluation tasks mapping . It establishes that 1-layer leftmost-hard-attention transformers can implement function evaluation for several input presentations with -size polylogarithmic in , while the challenging case of consecutive keys with permuted keys requires -size at least for hard attention, though a 1-layer softmax-attention construction achieves and a 2-layer variant also suffices. The paper provides explicit constructions and lower/upper bounds, plus VC-dimension and probabilistic analyses, to map the theoretical limits. Empirical experiments show alignment between the proven capabilities and learnability: concisely representable cases are learnable with small networks and learned positional embeddings, whereas the hardest case often requires deeper models, reflecting a link between conciseness and trainability.

Abstract

While transformers have proven enormously successful in a range of tasks, their fundamental properties as models of computation are not well understood. This paper contributes to the study of the expressive capacity of transformers, focusing on their ability to perform the fundamental computational task of evaluating an arbitrary function from to at a given argument. We prove that concise 1-layer transformers (i.e., with a polylog bound on the product of the number of heads, the embedding dimension, and precision) are capable of doing this task under some representations of the input, but not when the function's inputs and values are only encoded in different input positions. Concise 2-layer transformers can perform the task even with the more difficult input representation. Experimentally, we find a rough alignment between what we have proven can be computed by concise transformers and what can be practically learned.

Paper Structure

This paper contains 35 sections, 14 theorems, 38 equations, 2 figures, 2 tables.

Key Result

Theorem 3.1

For any positive integer $n$, there exists a 1-layer transformer that performs function evaluation for domain $[n]$ when the key $i$ and value $f(i)$ are stored in the same position and keys are permuted (case 3). The transformer uses leftmost hard attention,It seems likely that the hard attention t

Figures (2)

  • Figure 1: Each integer $k \in [8]$ is mapped to a point on the unit circle using the transformation $\mathop{\mathrm{cs}}\nolimits_8(k) = [\cos(\frac{2\pi k}{8}), \sin(\frac{2\pi k}{8})]$. E.g., $k=0$ maps to $(1,0)$, $k=2$ maps to $(0,1)$, and so on. Dashed radial lines show the angles between consecutive points.
  • Figure 2: Each violet violin illustrates the distribution of accuracies obtained, individual runs are represented by black dots, and the green horizontal lines indicate the mean accuracy. The results demonstrate a bimodal distribution for certain dimensions.

Theorems & Definitions (24)

  • Theorem 3.1
  • proof
  • Theorem 3.2
  • Theorem 3.3
  • Lemma 3.4
  • proof
  • Lemma 3.5
  • Theorem 3.6
  • Theorem 3.7
  • proof
  • ...and 14 more