Table of Contents
Fetching ...

MH-MoE: Multi-Head Mixture-of-Experts

Shaohan Huang, Xun Wu, Shuming Ma, Furu Wei

TL;DR

This work addresses efficient scaling of Mixture-of-Experts models by achieving FLOPs and parameter parity with standard SMoE while introducing a multi-head mechanism. The proposed MH-MoE adds a heads dimension and front/back projection layers to enable cross-expert attention across multiple representation spaces, demonstrated to outperform vanilla SMoE and fine-grained MoE on language modeling tasks. Through both standard and 1-bit BitNet experiments, MH-MoE shows robust improvements and compatibility with quantized LLM deployments, with ablations confirming that the head and merge layers are key contributors to performance gains. The approach offers a practical route to richer MoE representations without increasing computational cost, enabling more efficient, scalable language modeling.

Abstract

Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by using the multi-head mechanism to collectively attend to information from various representation spaces within different experts. In this paper, we present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with sparse Mixture of Experts models. Experimental results on language models show that the new implementation yields quality improvements over both vanilla MoE and fine-grained MoE models. Additionally, our experiments demonstrate that MH-MoE is compatible with 1-bit Large Language Models (LLMs) such as BitNet.

MH-MoE: Multi-Head Mixture-of-Experts

TL;DR

This work addresses efficient scaling of Mixture-of-Experts models by achieving FLOPs and parameter parity with standard SMoE while introducing a multi-head mechanism. The proposed MH-MoE adds a heads dimension and front/back projection layers to enable cross-expert attention across multiple representation spaces, demonstrated to outperform vanilla SMoE and fine-grained MoE on language modeling tasks. Through both standard and 1-bit BitNet experiments, MH-MoE shows robust improvements and compatibility with quantized LLM deployments, with ablations confirming that the head and merge layers are key contributors to performance gains. The approach offers a practical route to richer MoE representations without increasing computational cost, enabling more efficient, scalable language modeling.

Abstract

Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by using the multi-head mechanism to collectively attend to information from various representation spaces within different experts. In this paper, we present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with sparse Mixture of Experts models. Experimental results on language models show that the new implementation yields quality improvements over both vanilla MoE and fine-grained MoE models. Additionally, our experiments demonstrate that MH-MoE is compatible with 1-bit Large Language Models (LLMs) such as BitNet.

Paper Structure

This paper contains 10 sections, 10 equations, 5 tables.