Table of Contents
Fetching ...

Allocation of Parameters in Transformers

Ruoxi Yu, Haotian Jiang, Jingpu Cheng, Penghao Yu, Qianxiao Li, Zhong Li

TL;DR

This paper investigates how to allocate Transformer parameters, especially the number of heads and head dimension, under a fixed budget $D=\sum_{m=1}^M H_m d_m$, to balance expressivity and efficiency. It derives an approximation-error bound for information extraction in early layers and proves a softmax saturation phenomenon that enables reduced head dimensions in later layers. The authors propose principled allocation strategies and practical compression methods, including grouped heads and SVD-based low-rank head reductions, validated by synthetic tests and pretrained-model–style experiments. The results provide concrete guidelines for designing and compressing Transformer architectures with limited computational resources while preserving performance on long-range sequence tasks.

Abstract

Transformers have achieved remarkable successes across a wide range of applications, yet the theoretical foundation of their model efficiency remains underexplored. In this work, we investigate how the model parameters -- mainly attention heads and head dimensions -- should be allocated across layers to balance expressivity and efficiency. We first provide mathematical analysis on the role of early layers in information extraction from an approximation perspective, with a theoretical characterization on the trade-off between the number of heads and head dimension under a fixed parameter budget. In addition, we uncover and prove the \emph{saturation} behavior of softmax activations: Continuously increasing head dimensions can lead to diminishing returns in learning errors, particularly for long sequences. Supported by both theory and experiments, this saturation pattern suggests that later layers can operate more efficiently with reduced parameters. Combining these insights, we propose principled strategies for allocating attention heads and dimensions across Transformers' layers, shedding light on theoretically-grounded model efficiency of Transformer-based architectures.

Allocation of Parameters in Transformers

TL;DR

This paper investigates how to allocate Transformer parameters, especially the number of heads and head dimension, under a fixed budget , to balance expressivity and efficiency. It derives an approximation-error bound for information extraction in early layers and proves a softmax saturation phenomenon that enables reduced head dimensions in later layers. The authors propose principled allocation strategies and practical compression methods, including grouped heads and SVD-based low-rank head reductions, validated by synthetic tests and pretrained-model–style experiments. The results provide concrete guidelines for designing and compressing Transformer architectures with limited computational resources while preserving performance on long-range sequence tasks.

Abstract

Transformers have achieved remarkable successes across a wide range of applications, yet the theoretical foundation of their model efficiency remains underexplored. In this work, we investigate how the model parameters -- mainly attention heads and head dimensions -- should be allocated across layers to balance expressivity and efficiency. We first provide mathematical analysis on the role of early layers in information extraction from an approximation perspective, with a theoretical characterization on the trade-off between the number of heads and head dimension under a fixed parameter budget. In addition, we uncover and prove the \emph{saturation} behavior of softmax activations: Continuously increasing head dimensions can lead to diminishing returns in learning errors, particularly for long sequences. Supported by both theory and experiments, this saturation pattern suggests that later layers can operate more efficiently with reduced parameters. Combining these insights, we propose principled strategies for allocating attention heads and dimensions across Transformers' layers, shedding light on theoretically-grounded model efficiency of Transformer-based architectures.

Paper Structure

This paper contains 42 sections, 21 theorems, 81 equations, 10 figures.

Key Result

Theorem 4.1

To approximate the linear target $\mathrm{H}_t(\boldsymbol{X})=\sum_{i=0}^{\infty}\boldsymbol{\rho}_i\boldsymbol{x}_{t-i}$, we employ $M$ groups of heads, where group $m$ contains $H_m$ heads each of the dimension $d_m$. Given the fixed model dimension $D=\sum_{m=1}^M H_m\cdot d_m$, with probability where and the first term in equation (equation: E(D)) equals zero when $d_m> d$.

Figures (10)

  • Figure 1: Trade-off Trend
  • Figure 2: 4-gram Example
  • Figure 3: Saturation Scaling Law of Softmax
  • Figure 4: LiteLlama Single Head: Training Loss vs $d_h$
  • Figure 5: $6$-layer Transformer: Loss vs Head Dimension
  • ...and 5 more figures

Theorems & Definitions (31)

  • Theorem 4.1
  • Corollary 4.1: Parameter allocation via trade-offs
  • Lemma 4.1: Volterra Series Decomposition
  • Theorem 4.2: Trade-offs via parameter allocation; informal
  • Theorem 5.1: The saturation pattern of softmax
  • Corollary 5.1
  • Lemma B.1
  • proof
  • Lemma B.2: High-probability projection
  • proof
  • ...and 21 more