Lower bounds on transformers with infinite precision

Alexander Kozachinskiy

Lower bounds on transformers with infinite precision

Alexander Kozachinskiy

TL;DR

The paper addresses lower bounds for 1-layer softmax transformers in the infinite-precision setting, proving limitations for two tasks: $Comp_n$ and $Sum_2^{n,n}$. It introduces a VC-dimension–based method that upper-bounds the complexity of the induced hypothesis class by counting the parameters and arithmetic operations of the ReLU output MLP, and uses this to derive contradictions when $d = n^{o(1)}$ and $| abla ext{N}| = n^{o(1)}$. As a consequence, any 1-layer softmax transformer for these tasks must have either embedding dimension or output MLP size at least $n^{\,Omega(1)}$. The authors also discuss palindrome recognition, noting it can be solved with constant resources under infinite precision but exhibits different VC-dimension behavior under limited precision; the method connects to prior communication-complexity lower bounds for related problems. This work provides a new theoretical tool for understanding transformer capabilities and the limits of softmax attention in the infinite-precision regime.

Abstract

In this note, we use the VC dimension technique to prove the first lower bound against one-layer softmax transformers with infinite precision. We do so for two tasks: function composition, considered by Peng, Narayanan, and Papadimitriou, and the SUM$_2$ task, considered by Sanford, Hsu, and Telgarsky.

Lower bounds on transformers with infinite precision

TL;DR

The paper addresses lower bounds for 1-layer softmax transformers in the infinite-precision setting, proving limitations for two tasks:

and

. It introduces a VC-dimension–based method that upper-bounds the complexity of the induced hypothesis class by counting the parameters and arithmetic operations of the ReLU output MLP, and uses this to derive contradictions when

and

. As a consequence, any 1-layer softmax transformer for these tasks must have either embedding dimension or output MLP size at least

. The authors also discuss palindrome recognition, noting it can be solved with constant resources under infinite precision but exhibits different VC-dimension behavior under limited precision; the method connects to prior communication-complexity lower bounds for related problems. This work provides a new theoretical tool for understanding transformer capabilities and the limits of softmax attention in the infinite-precision regime.

Abstract

task, considered by Sanford, Hsu, and Telgarsky.

Paper Structure (4 sections, 3 theorems, 17 equations)

This paper contains 4 sections, 3 theorems, 17 equations.

Introduction
The model
Proofs
Acknowledgment

Key Result

Theorem 1

There is no 1-layer single-token output transformer with embedding dimension $n^{o(1)}$ and output MLP with $n^{o(1)}$ ReLU neurons that computes $\mathrm{Comp}_n$.

Theorems & Definitions (7)

Definition 1
Theorem 1
proof
Theorem 2
proof
Theorem 3
proof : Proof sketch

Lower bounds on transformers with infinite precision

TL;DR

Abstract

Lower bounds on transformers with infinite precision

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (7)