Lower bounds on transformers with infinite precision
Alexander Kozachinskiy
TL;DR
The paper addresses lower bounds for 1-layer softmax transformers in the infinite-precision setting, proving limitations for two tasks: $Comp_n$ and $Sum_2^{n,n}$. It introduces a VC-dimension–based method that upper-bounds the complexity of the induced hypothesis class by counting the parameters and arithmetic operations of the ReLU output MLP, and uses this to derive contradictions when $d = n^{o(1)}$ and $| abla ext{N}| = n^{o(1)}$. As a consequence, any 1-layer softmax transformer for these tasks must have either embedding dimension or output MLP size at least $n^{\,Omega(1)}$. The authors also discuss palindrome recognition, noting it can be solved with constant resources under infinite precision but exhibits different VC-dimension behavior under limited precision; the method connects to prior communication-complexity lower bounds for related problems. This work provides a new theoretical tool for understanding transformer capabilities and the limits of softmax attention in the infinite-precision regime.
Abstract
In this note, we use the VC dimension technique to prove the first lower bound against one-layer softmax transformers with infinite precision. We do so for two tasks: function composition, considered by Peng, Narayanan, and Papadimitriou, and the SUM$_2$ task, considered by Sanford, Hsu, and Telgarsky.
