MoH: Multi-Head Attention as Mixture-of-Head Attention
Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan
TL;DR
MoH reimagines multi-head attention by treating heads as Mixture-of-Experts and routing per token to a Top-K subset, with shared heads and a two-stage routing scheme that balances global and task-specific knowledge. It replaces the standard head summation with a weighted combination, preserving parameter counts while enabling dynamic head activation to improve efficiency and accuracy across ViT, DiT, and LLMs. Extensive experiments show MoH achieves equal or better performance using only 50%-90% of the heads and can continue-tune pre-trained MHA models (e.g., LLaMA3-8B) to MoH with notable gains. The work presents MoH as a versatile, scalable approach to efficient attention for diverse vision and language tasks, with clear deployment advantages for large-scale models.
Abstract
In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%-90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.
