Allocation of Parameters in Transformers
Ruoxi Yu, Haotian Jiang, Jingpu Cheng, Penghao Yu, Qianxiao Li, Zhong Li
TL;DR
This paper investigates how to allocate Transformer parameters, especially the number of heads and head dimension, under a fixed budget $D=\sum_{m=1}^M H_m d_m$, to balance expressivity and efficiency. It derives an approximation-error bound for information extraction in early layers and proves a softmax saturation phenomenon that enables reduced head dimensions in later layers. The authors propose principled allocation strategies and practical compression methods, including grouped heads and SVD-based low-rank head reductions, validated by synthetic tests and pretrained-model–style experiments. The results provide concrete guidelines for designing and compressing Transformer architectures with limited computational resources while preserving performance on long-range sequence tasks.
Abstract
Transformers have achieved remarkable successes across a wide range of applications, yet the theoretical foundation of their model efficiency remains underexplored. In this work, we investigate how the model parameters -- mainly attention heads and head dimensions -- should be allocated across layers to balance expressivity and efficiency. We first provide mathematical analysis on the role of early layers in information extraction from an approximation perspective, with a theoretical characterization on the trade-off between the number of heads and head dimension under a fixed parameter budget. In addition, we uncover and prove the \emph{saturation} behavior of softmax activations: Continuously increasing head dimensions can lead to diminishing returns in learning errors, particularly for long sequences. Supported by both theory and experiments, this saturation pattern suggests that later layers can operate more efficiently with reduced parameters. Combining these insights, we propose principled strategies for allocating attention heads and dimensions across Transformers' layers, shedding light on theoretically-grounded model efficiency of Transformer-based architectures.
