Table of Contents
Fetching ...

Leaner Transformers: More Heads, Less Depth

Hemanth Saratchandran, Damien Teney, Simon Lucey

TL;DR

The paper tackles the problem of overparameterization in transformers by proposing that multi-head attention inherently improves the conditioning of attention blocks, allowing deeper models to be replaced with leaner architectures. It introduces a theoretical framework showing that with enough heads, the attention matrix becomes well-conditioned (low condition number), facilitating optimization. Empirically, the authors demonstrate across vision and language tasks that increasing heads while reducing depth yields substantial parameter and memory savings (up to ~30-50%) with comparable or improved accuracy on ImageNet-1k, GLUE, TinyStories, and LRA. The findings suggest a practical design principle for efficient transformers and raise questions about the fundamental limits and scalability of lean architectures in large-scale settings.

Abstract

Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets, leading to significant improvements in performance. This success has contributed to the belief that "bigger means better", leading to ever-increasing model sizes. This paper challenge this ideology by showing that many existing transformers might be unnecessarily oversized. We discover a theoretical principle that redefines the role of multi-head attention. An important benefit of the multiple heads is in improving the conditioning of the attention block. We exploit this theoretical insight and redesign popular architectures with an increased number of heads. The improvement in the conditioning proves so significant in practice that model depth can be decreased, reducing the parameter count by up to 30-50% while maintaining accuracy. We obtain consistent benefits across a variety of transformer-based architectures of various scales, on tasks in computer vision (ImageNet-1k) as well as language and sequence modeling (GLUE benchmark, TinyStories, and the Long-Range Arena benchmark).

Leaner Transformers: More Heads, Less Depth

TL;DR

The paper tackles the problem of overparameterization in transformers by proposing that multi-head attention inherently improves the conditioning of attention blocks, allowing deeper models to be replaced with leaner architectures. It introduces a theoretical framework showing that with enough heads, the attention matrix becomes well-conditioned (low condition number), facilitating optimization. Empirically, the authors demonstrate across vision and language tasks that increasing heads while reducing depth yields substantial parameter and memory savings (up to ~30-50%) with comparable or improved accuracy on ImageNet-1k, GLUE, TinyStories, and LRA. The findings suggest a practical design principle for efficient transformers and raise questions about the fundamental limits and scalability of lean architectures in large-scale settings.

Abstract

Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets, leading to significant improvements in performance. This success has contributed to the belief that "bigger means better", leading to ever-increasing model sizes. This paper challenge this ideology by showing that many existing transformers might be unnecessarily oversized. We discover a theoretical principle that redefines the role of multi-head attention. An important benefit of the multiple heads is in improving the conditioning of the attention block. We exploit this theoretical insight and redesign popular architectures with an increased number of heads. The improvement in the conditioning proves so significant in practice that model depth can be decreased, reducing the parameter count by up to 30-50% while maintaining accuracy. We obtain consistent benefits across a variety of transformer-based architectures of various scales, on tasks in computer vision (ImageNet-1k) as well as language and sequence modeling (GLUE benchmark, TinyStories, and the Long-Range Arena benchmark).

Paper Structure

This paper contains 37 sections, 2 theorems, 9 equations, 9 figures, 6 tables.

Key Result

Theorem 3.2

Let $\mathbf{A}_i \in \mathbb{R}^{N\times \frac{D}{h}}$ be i.i.d Gaussian random variables ($1 \leq i \leq h$). We define the multi-head matrix block $\mathbf{A} = [\mathbf{A}_1, \cdots, \mathbf{A}_h]$ of dimension $N \times D$ and assume $D >> N$. Then, the condition number Moreover, if we fix the dimension of the attention heads $d > 0$ such that $\mathbf{A}_i \in \mathbb{R}^{N\times d}$, we ha

Figures (9)

  • Figure 1: We redesign popular transformers models with an increased number of heads, using the theoretical insight that multi-head attention contributes to improving the conditioning of attention blocks. The benefits are so significant that we can reduce model depth while maintaining or improving accuracy, using about 50% fewer parameters.
  • Figure 2: Empirical measurement of the condition number of the attention layers in ViT-Bs with different numbers of heads. The conditioning improves (lower number) with additional heads, following the predictions of \ref{['thm:main']}.
  • Figure 3: Accuracy on ImageNet-1k of variants of ViT-B with the original depth (12 layers, left) or reduced to 8 layers (right). Each point is annotated with the model's total number of parameters (in millions). According to our predictions, the number of heads correlates with performance. Remarkably, our models with reduced depth (right) and $\geq$12 heads (green dots) all obtain a higher test accuracy with fewer parameters than the original model (dotted line).
  • Figure 4: Similar experiments as \ref{['fig:vitb_depth_heads']}, where each model is now a variant of ViT-B with a different MLP width (X axes, reported as a factor of the token-embedding size). According to our predictions, increasing the width of MLPs has a weaker effect than adding attention heads. The slight benefit observed with 12 layers (left) cannot compensate for a reduction of depth to 8 layers (right), unlike what was observed with additional heads in \ref{['fig:vitb_depth_heads']}.
  • Figure 5: Additional variants of ViT-B with different numbers of layers and heads, and MLP width. Each model is annotated with its reduction in parameters. For 6–-8 layers, doubling the MLP width yields little benefit, indicating that the number of heads is more important.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Definition 3.1
  • Theorem 3.2
  • Lemma 3.3
  • proof : Proof of \ref{['thm:main']}
  • proof : Proof of Lemma \ref{['lem;random_draws']}