On the Benefits of Rank in Attention Layers
Noah Amsel, Gilad Yehudai, Joan Bruna
TL;DR
This work analyzes the expressive power of attention with respect to the query/key rank $r$ and the number of heads $H$. It proves a rank separation for a nearest-neighbor target: a single full-rank head can approximate it arbitrarily well, whereas low-rank heads require exponential growth in $d$ (or $(d/r)^{1/()}$-type scales) of $H$ to achieve comparable accuracy. The authors further show that depth can mitigate this weakness for short contexts, enabling polynomially many rank-1 heads to approximate the target, though the same approach may not extend to long contexts. Experiments with off-the-shelf transformers corroborate the theory, revealing that standard $H=d/r$ scaling may understate the expressive power of full-rank attention and that low-rank attention can be significantly weaker under practical budgets. Overall, the paper calls for rethinking hyperparameter choices in transformers and highlights depth as a potential remedy for some low-rank limitations, especially in shorter contexts.
Abstract
Attention-based mechanisms are widely used in machine learning, most prominently in transformers. However, hyperparameters such as the rank of the attention matrices and the number of heads are scaled nearly the same way in all realizations of this architecture, without theoretical justification. In this work we show that there are dramatic trade-offs between the rank and number of heads of the attention mechanism. Specifically, we present a simple and natural target function that can be represented using a single full-rank attention head for any context length, but that cannot be approximated by low-rank attention unless the number of heads is exponential in the embedding dimension, even for short context lengths. Moreover, we prove that, for short context lengths, adding depth allows the target to be approximated by low-rank attention. For long contexts, we conjecture that full-rank attention is necessary. Finally, we present experiments with off-the-shelf transformers that validate our theoretical findings.
