The Effect of Attention Head Count on Transformer Approximation
Penghao Yu, Haotian Jiang, Zeyu Bao, Ruoxi Yu, Qianxiao Li
TL;DR
This work investigates how the number of attention heads $h$ in transformers governs approximation efficiency for sequence-to-vector mappings. It introduces generalized $D$-retrieval tasks that are dense in the space of continuous mappings and derives upper and lower bounds on the parameter count needed for $\epsilon$-approximation, revealing a bottleneck when $h<D$ and efficiency when $h\ge D$. The authors also characterize a memorization-dominated regime for a single head with large embedding, and validate the theory with synthetic experiments and real-data tasks (e.g., MS MARCO and CIFAR-10), where a phase transition around the intrinsic dimension $D$ emerges. Overall, the results provide principled guidance for selecting head counts in transformers and underscore the practical impact of architectural choices on expressive power and efficiency.
Abstract
Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the introduction of a generalized $D$-retrieval task, which we prove to be dense in the space of continuous functions, thereby providing the basis for our theoretical framework. We then establish both upper and lower bounds on the parameter complexity required for $ε$-approximation. Specifically, we show that transformers with sufficiently many heads admit efficient approximation, whereas with too few heads, the number of parameters must scale at least as $O(1/ε^{cT})$, for some constant $c$ and sequence length $T$. To the best of our knowledge, this constitutes the first rigorous lower bound of this type in a nonlinear and practically relevant setting. We further examine the single-head case and demonstrate that an embedding dimension of order $O(T)$ allows complete memorization of the input, where approximation is entirely achieved by the feed-forward block. Finally, we validate our theoretical findings with experiments on both synthetic data and real-world tasks, illustrating the practical relevance of our results.
