Table of Contents
Fetching ...

MoH: Multi-Head Attention as Mixture-of-Head Attention

Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan

TL;DR

MoH reimagines multi-head attention by treating heads as Mixture-of-Experts and routing per token to a Top-K subset, with shared heads and a two-stage routing scheme that balances global and task-specific knowledge. It replaces the standard head summation with a weighted combination, preserving parameter counts while enabling dynamic head activation to improve efficiency and accuracy across ViT, DiT, and LLMs. Extensive experiments show MoH achieves equal or better performance using only 50%-90% of the heads and can continue-tune pre-trained MHA models (e.g., LLaMA3-8B) to MoH with notable gains. The work presents MoH as a versatile, scalable approach to efficient attention for diverse vision and language tasks, with clear deployment advantages for large-scale models.

Abstract

In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

MoH: Multi-Head Attention as Mixture-of-Head Attention

TL;DR

MoH reimagines multi-head attention by treating heads as Mixture-of-Experts and routing per token to a Top-K subset, with shared heads and a two-stage routing scheme that balances global and task-specific knowledge. It replaces the standard head summation with a weighted combination, preserving parameter counts while enabling dynamic head activation to improve efficiency and accuracy across ViT, DiT, and LLMs. Extensive experiments show MoH achieves equal or better performance using only 50%-90% of the heads and can continue-tune pre-trained MHA models (e.g., LLaMA3-8B) to MoH with notable gains. The work presents MoH as a versatile, scalable approach to efficient attention for diverse vision and language tasks, with clear deployment advantages for large-scale models.

Abstract

In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

Paper Structure

This paper contains 24 sections, 10 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: A high-level comparison between the multi-head attention and our proposed mixture-of-head attention. Subfigure (a) illustrates a standard multi-head attention layer with $h$ attention heads, while subfigure (b) demonstrates our proposed Mixture-of-Head attention (MoH) architecture. It is important to note that MoH does not increase the number of attention heads, ensuring that the total parameter for MoH is comparable to that of the multi-head attention.
  • Figure 2: Performance evolution during continue-tuning. The MoH model quickly recovers to over 95% of the performance of the original model within a training budget of 10B tokens. Then, the performance gradually improves with the increase of the training tokens.
  • Figure 3: Visualization of the head load distribution in the final MoH layer. For ViT and DiT, we present the head load distributions for the categories "Desk", "Goldfish", and "Ice cream". For LLM, we display the head distributions for the tasks "LogiQA", "PIQA", and "WinoGrande". MoH-ViT-B, MoH-DiT-XL/2, and MoH-LLM-B activate 75%, 90%, and 75% of the attention heads, respectively. "Density" denotes the ratio of the number of head activations to the total number of tokens.
  • Figure A: Additional visualization of the head load distribution in the final MoH layer. MoH-ViT-B activates 75% of the attention heads. MoH-DiT-XL/2 activates 90% of the attention heads.
  • Figure B: Additional visualization of the head load distribution in MoH-LLaMA3-8B.
  • ...and 4 more figures