Table of Contents
Fetching ...

The Effect of Attention Head Count on Transformer Approximation

Penghao Yu, Haotian Jiang, Zeyu Bao, Ruoxi Yu, Qianxiao Li

TL;DR

This work investigates how the number of attention heads $h$ in transformers governs approximation efficiency for sequence-to-vector mappings. It introduces generalized $D$-retrieval tasks that are dense in the space of continuous mappings and derives upper and lower bounds on the parameter count needed for $\epsilon$-approximation, revealing a bottleneck when $h<D$ and efficiency when $h\ge D$. The authors also characterize a memorization-dominated regime for a single head with large embedding, and validate the theory with synthetic experiments and real-data tasks (e.g., MS MARCO and CIFAR-10), where a phase transition around the intrinsic dimension $D$ emerges. Overall, the results provide principled guidance for selecting head counts in transformers and underscore the practical impact of architectural choices on expressive power and efficiency.

Abstract

Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the introduction of a generalized $D$-retrieval task, which we prove to be dense in the space of continuous functions, thereby providing the basis for our theoretical framework. We then establish both upper and lower bounds on the parameter complexity required for $ε$-approximation. Specifically, we show that transformers with sufficiently many heads admit efficient approximation, whereas with too few heads, the number of parameters must scale at least as $O(1/ε^{cT})$, for some constant $c$ and sequence length $T$. To the best of our knowledge, this constitutes the first rigorous lower bound of this type in a nonlinear and practically relevant setting. We further examine the single-head case and demonstrate that an embedding dimension of order $O(T)$ allows complete memorization of the input, where approximation is entirely achieved by the feed-forward block. Finally, we validate our theoretical findings with experiments on both synthetic data and real-world tasks, illustrating the practical relevance of our results.

The Effect of Attention Head Count on Transformer Approximation

TL;DR

This work investigates how the number of attention heads in transformers governs approximation efficiency for sequence-to-vector mappings. It introduces generalized -retrieval tasks that are dense in the space of continuous mappings and derives upper and lower bounds on the parameter count needed for -approximation, revealing a bottleneck when and efficiency when . The authors also characterize a memorization-dominated regime for a single head with large embedding, and validate the theory with synthetic experiments and real-data tasks (e.g., MS MARCO and CIFAR-10), where a phase transition around the intrinsic dimension emerges. Overall, the results provide principled guidance for selecting head counts in transformers and underscore the practical impact of architectural choices on expressive power and efficiency.

Abstract

Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the introduction of a generalized -retrieval task, which we prove to be dense in the space of continuous functions, thereby providing the basis for our theoretical framework. We then establish both upper and lower bounds on the parameter complexity required for -approximation. Specifically, we show that transformers with sufficiently many heads admit efficient approximation, whereas with too few heads, the number of parameters must scale at least as , for some constant and sequence length . To the best of our knowledge, this constitutes the first rigorous lower bound of this type in a nonlinear and practically relevant setting. We further examine the single-head case and demonstrate that an embedding dimension of order allows complete memorization of the input, where approximation is entirely achieved by the feed-forward block. Finally, we validate our theoretical findings with experiments on both synthetic data and real-world tasks, illustrating the practical relevance of our results.

Paper Structure

This paper contains 64 sections, 11 theorems, 137 equations, 6 figures, 4 tables.

Key Result

Theorem 1

For fixed $d,T$, the family $\{\mathcal{F}_D\}_{D=1}^{\infty}$ is dense in $C(\mathcal{X}_T)$. That is, for every $F \in C(\mathcal{X}_T)$ and every $\epsilon>0$, there exists $D$ and $f \in \mathcal{F}_D$ such that

Figures (6)

  • Figure 1: Results on the synthetic example. (a) NMSE vs. number of heads $h$ for sequence lengths $T \in \{8,16,32,64,128\}$, hidden dimension fixed at $N=32$. Note that there is a transition at $H=4$. (A table of mean and variance values corresponding to these curves is provided in Table \ref{['table:synthetic-variance']}.) (b) Log Hidden Dimension $N$ vs. Log Accuracy for different sequence lengths $T$. The parameter count $k$ for the MLPs change linearly with $N$. (Plots for $H=1$ and $H=2$ is in Figure \ref{['fig:synthetic-nmse-appendix']}.)
  • Figure 2: Experiments on real datasets. Training performance with different numbers of heads $h$ across different sequence lengths $T$. (a) Accuracy vs. number of heads for different $T$ in text retrieval; phase transition near $h=12$. Mean and standard deviation see Table \ref{['table:MS_Mean_var']}. MRR shows a similar trend, see Fig. \ref{['fig:ms mrr']} in the appendix. (b) Phase transition for text retrieval. (c) Accuracy vs. number of heads for different $T$ in image classification; phase transition near $h=10$. Mean and standard deviation see Table \ref{['table: image']}. (d) Weighted Reversal Score for Image Classification, $err = 1-Accuracy$. The plot becomes positive when $h\geq 10$, indicating phase transition. (e) Weighted Reversal Score for Synthetic Experiment, it becomes positive at $h=4$, exactly the intrinsic dimension of the task.
  • Figure 3: A zoom in plot of Figure\ref{['fig:synthetic-nmse']}, which shows that when the number of head is enough, the loss first decreases and then increases, as explained in the remark \ref{['observation:reverse']}
  • Figure 4: Additional plot of \ref{['fig:synthetic-nmse-b']} for $H=1$ and $H=2$.
  • Figure 5: Plot of training mrr for MS MARCO dataset.
  • ...and 1 more figures

Theorems & Definitions (31)

  • Definition 1: $\epsilon$-approximation
  • Remark
  • Theorem 1: Density of the target class
  • Theorem 2: Approximation rates of transformers
  • Remark
  • proof : Proof of Theorem \ref{['thm:thm1']}
  • Lemma 1: Relaxed target class and closure equivalence
  • proof
  • Remark
  • Lemma 2: Order-statistic in the relaxed class
  • ...and 21 more