Inductive Biases and Variable Creation in Self-Attention Mechanisms
Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Cyril Zhang
TL;DR
This work provides a rigorous statistical analysis of self-attention and Transformer architectures, revealing a sparse-variable creation inductive bias: with bounded weight norms, a single self-attention head can realize sparse input dependencies and generalize with a sample complexity that scales only logarithmically with context length. The authors develop a novel covering-number-based capacity bound for attention modules and show matching representation results that sparse functions can be encoded efficiently. They extend these results to multi-layer Transformers and discuss practical implications of positional embeddings, residuals, and multi-head attention. Through synthetic experiments, the paper confirms the predicted $\log T$ scaling in learning sparse Boolean functions and highlights interesting phenomena like parity learning under i.i.d. sampling. Overall, the work provides a principled bridge between the practical success of attention mechanisms and their theoretical capacity to represent sparse, long-range dependencies.
Abstract
Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules. Our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer networks "create sparse variables": a single self-attention head can represent a sparse function of the input sequence, with sample complexity scaling only logarithmically with the context length. To support our analysis, we present synthetic experiments to probe the sample complexity of learning sparse Boolean functions with Transformers.
