Improving Transformers with Dynamically Composable Multi-Head Attention

Da Xiao; Qingye Meng; Shengping Li; Xingyuan Yuan

Improving Transformers with Dynamically Composable Multi-Head Attention

Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan

TL;DR

This work tackles the inefficiencies of standard multi-head attention by introducing Dynamically Composable Multi-Head Attention (DCMHA), which dynamically composes attention heads through a Compose function applied to attention score and weight matrices. By leveraging a tensor-decomposition-based, input-dependent transformation, DCFormer acts as a drop-in replacement for MHA, yielding significant gains in language modeling and downstream tasks while achieving ~1.7×–2× compute efficiency. The approach provides both theoretical connections to projection-based head composition and practical mechanisms to maintain efficiency via low-rank plus diagonal decompositions and grouped tensor parallel training. Empirically, DCFormer achieves strong scaling behavior, outperforms baseline Transformers across model sizes, and transfers to vision transformers with improved image classification performance, all while offering interpretable insights into head diversity and the dynamic interaction of QK and OV circuits.

Abstract

Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads work independently, causing problems such as low-rank bottleneck of attention score matrices and head redundancy. We propose Dynamically Composable Multi-Head Attention (DCMHA), a parameter and computation efficient attention architecture that tackles the shortcomings of MHA and increases the expressive power of the model by dynamically composing attention heads. At the core of DCMHA is a $\it{Compose}$ function that transforms the attention score and weight matrices in an input-dependent way. DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer. DCFormer significantly outperforms Transformer on different architectures and model scales in language modeling, matching the performance of models with ~1.7x-2.0x compute. For example, DCPythia-6.9B outperforms open source Pythia-12B on both pretraining perplexity and downstream task evaluation. The code and models are available at https://github.com/Caiyun-AI/DCFormer.

Improving Transformers with Dynamically Composable Multi-Head Attention

TL;DR

Abstract

function that transforms the attention score and weight matrices in an input-dependent way. DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer. DCFormer significantly outperforms Transformer on different architectures and model scales in language modeling, matching the performance of models with ~1.7x-2.0x compute. For example, DCPythia-6.9B outperforms open source Pythia-12B on both pretraining perplexity and downstream task evaluation. The code and models are available at https://github.com/Caiyun-AI/DCFormer.

Paper Structure (26 sections, 2 theorems, 13 equations, 8 figures, 11 tables)

This paper contains 26 sections, 2 theorems, 13 equations, 8 figures, 11 tables.

Introduction
Head Composition by Transforming Attention Matrices
Dynamically Composable Multi-Head Attention
A Tensor Decomposition Perspective
Grouped Composition for Tensor Parallel Training
Complexity Analysis
Experiments
Scaling Laws
Large Scaling Training and Downstream Evaluations
Synthetic Tasks and Weight Analysis of Trained Models
Training and Inference Overhead
Image Classification
Ablation Studies and Tradeoffs
Conclusion
Related work
...and 11 more sections

Key Result

Theorem 2.1

Composition of attention scores $\{A_i\}_{i=1}^H$ by composition map $C \in \mathbb{R}^{H \times H}$ is equivalent to QK projection composition with $H$-fold expansion defined in Eqn. eq:compose_wqk.

Figures (8)

Figure 1: Simplified and prototypical composition maps for 8 heads and their functions. Lighter color denotes larger value.
Figure 2: Illustration of DCMHA. (a) Scale and optional mask operations are omitted. Each linear projection's input and output are denoted by their dims and the projected (i.e. mixed) dims are colored. (b) Attention vector $A_{:ij}$ can be either attention scores or weights.
Figure 3: (top) Scaling curves of Transformers and DCFormers. (bottom) Scaling curves of relative improvement of RoPE + SwiGLU MLP and DCMHA. TFM++ = TFM + RoPE + SwiGLU MLP; DCFM = TFM + DCMHA; DCFM++ = TFM++ + DCMHA.
Figure 4: Scaling curves of Pythia and DCPythia.
Figure 5: Mean cumulative captured variance of concatenated QK and OV heads of Pythia-6.9B and DCPythia-6.9B.
...and 3 more figures

Theorems & Definitions (2)

Theorem 2.1
Theorem 2.2

Improving Transformers with Dynamically Composable Multi-Head Attention

TL;DR

Abstract

Improving Transformers with Dynamically Composable Multi-Head Attention

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (2)