Table of Contents
Fetching ...

Lower bounds on transformers with infinite precision

Alexander Kozachinskiy

TL;DR

The paper addresses lower bounds for 1-layer softmax transformers in the infinite-precision setting, proving limitations for two tasks: $Comp_n$ and $Sum_2^{n,n}$. It introduces a VC-dimension–based method that upper-bounds the complexity of the induced hypothesis class by counting the parameters and arithmetic operations of the ReLU output MLP, and uses this to derive contradictions when $d = n^{o(1)}$ and $| abla ext{N}| = n^{o(1)}$. As a consequence, any 1-layer softmax transformer for these tasks must have either embedding dimension or output MLP size at least $n^{\,Omega(1)}$. The authors also discuss palindrome recognition, noting it can be solved with constant resources under infinite precision but exhibits different VC-dimension behavior under limited precision; the method connects to prior communication-complexity lower bounds for related problems. This work provides a new theoretical tool for understanding transformer capabilities and the limits of softmax attention in the infinite-precision regime.

Abstract

In this note, we use the VC dimension technique to prove the first lower bound against one-layer softmax transformers with infinite precision. We do so for two tasks: function composition, considered by Peng, Narayanan, and Papadimitriou, and the SUM$_2$ task, considered by Sanford, Hsu, and Telgarsky.

Lower bounds on transformers with infinite precision

TL;DR

The paper addresses lower bounds for 1-layer softmax transformers in the infinite-precision setting, proving limitations for two tasks: and . It introduces a VC-dimension–based method that upper-bounds the complexity of the induced hypothesis class by counting the parameters and arithmetic operations of the ReLU output MLP, and uses this to derive contradictions when and . As a consequence, any 1-layer softmax transformer for these tasks must have either embedding dimension or output MLP size at least . The authors also discuss palindrome recognition, noting it can be solved with constant resources under infinite precision but exhibits different VC-dimension behavior under limited precision; the method connects to prior communication-complexity lower bounds for related problems. This work provides a new theoretical tool for understanding transformer capabilities and the limits of softmax attention in the infinite-precision regime.

Abstract

In this note, we use the VC dimension technique to prove the first lower bound against one-layer softmax transformers with infinite precision. We do so for two tasks: function composition, considered by Peng, Narayanan, and Papadimitriou, and the SUM task, considered by Sanford, Hsu, and Telgarsky.
Paper Structure (4 sections, 3 theorems, 17 equations)

This paper contains 4 sections, 3 theorems, 17 equations.

Key Result

Theorem 1

There is no 1-layer single-token output transformer with embedding dimension $n^{o(1)}$ and output MLP with $n^{o(1)}$ ReLU neurons that computes $\mathrm{Comp}_n$.

Theorems & Definitions (7)

  • Definition 1
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof : Proof sketch