Table of Contents
Fetching ...

Layerwise Recurrent Router for Mixture-of-Experts

Zihan Qiu, Zeyu Huang, Shuang Cheng, Yizhi Zhou, Zili Wang, Ivan Titov, Jie Fu

TL;DR

This work addresses parameter inefficiency in Mixture-of-Experts MoE language models by identifying that independent layer-wise routing limits exploration and expert utilization. It introduces the Layerwise Recurrent Router (RMoE), which employs a cross-layer GRU to condition each layer's routing on previous layers, coupled with a per-layer projection to prevent embedding collapse. Empirical results show RMoE consistently surpasses baselines across language modeling tasks and scales, with analyses revealing increased cross-layer mutual information, more balanced gate entropy, and greater expert diversity driven by Recurrent Gradient. The approach is orthogonal and compatible with other MoE designs, offering a practical pathway to more parameter-efficient, scalable LLMs.

Abstract

The scaling of large language models (LLMs) has revolutionized their capabilities in various tasks, yet this growth must be matched with efficient computational strategies. The Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs. Despite their advantages, current MoE models often display parameter inefficiency. For instance, a pre-trained MoE-based LLM with 52 billion parameters might perform comparably to a standard model with 6.7 billion parameters. Being a crucial part of MoE, current routers in different layers independently assign tokens without leveraging historical routing information, potentially leading to suboptimal token-expert combinations and the parameter inefficiency problem. To alleviate this issue, we introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE). RMoE leverages a Gated Recurrent Unit (GRU) to establish dependencies between routing decisions across consecutive layers. Such layerwise recurrence can be efficiently parallelly computed for input tokens and introduces negotiable costs. Our extensive empirical evaluations demonstrate that RMoE-based language models consistently outperform a spectrum of baseline models. Furthermore, RMoE integrates a novel computation stage orthogonal to existing methods, allowing seamless compatibility with other MoE architectures. Our analyses attribute RMoE's gains to its effective cross-layer information sharing, which also improves expert selection and diversity. Our code is at https://github.com/qiuzh20/RMoE .

Layerwise Recurrent Router for Mixture-of-Experts

TL;DR

This work addresses parameter inefficiency in Mixture-of-Experts MoE language models by identifying that independent layer-wise routing limits exploration and expert utilization. It introduces the Layerwise Recurrent Router (RMoE), which employs a cross-layer GRU to condition each layer's routing on previous layers, coupled with a per-layer projection to prevent embedding collapse. Empirical results show RMoE consistently surpasses baselines across language modeling tasks and scales, with analyses revealing increased cross-layer mutual information, more balanced gate entropy, and greater expert diversity driven by Recurrent Gradient. The approach is orthogonal and compatible with other MoE designs, offering a practical pathway to more parameter-efficient, scalable LLMs.

Abstract

The scaling of large language models (LLMs) has revolutionized their capabilities in various tasks, yet this growth must be matched with efficient computational strategies. The Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs. Despite their advantages, current MoE models often display parameter inefficiency. For instance, a pre-trained MoE-based LLM with 52 billion parameters might perform comparably to a standard model with 6.7 billion parameters. Being a crucial part of MoE, current routers in different layers independently assign tokens without leveraging historical routing information, potentially leading to suboptimal token-expert combinations and the parameter inefficiency problem. To alleviate this issue, we introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE). RMoE leverages a Gated Recurrent Unit (GRU) to establish dependencies between routing decisions across consecutive layers. Such layerwise recurrence can be efficiently parallelly computed for input tokens and introduces negotiable costs. Our extensive empirical evaluations demonstrate that RMoE-based language models consistently outperform a spectrum of baseline models. Furthermore, RMoE integrates a novel computation stage orthogonal to existing methods, allowing seamless compatibility with other MoE architectures. Our analyses attribute RMoE's gains to its effective cross-layer information sharing, which also improves expert selection and diversity. Our code is at https://github.com/qiuzh20/RMoE .
Paper Structure (43 sections, 6 equations, 14 figures, 12 tables)

This paper contains 43 sections, 6 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Recurrent router for Mixture-of-Experts. In the $i$-th layer, the hidden state $\mathbf{x}_i$ is I. projected to $\mathbf{x}^\prime$ with alower hidden dimension (Eq. \ref{['eq:RMoE-1']}), II. combined with previous layer's GRU output $\mathbf{h}_{i-1}$, and processed through the cross-layer-shared GRU to produce the current layer's GRU output, $\mathbf{h}_i$ (Eq. \ref{['eq:RMoE-hidden-state']}). III. layer $i$'s router uses this output to select experts and executes standard MoE computation (Eq. \ref{['eq:RMoE-3']}). Such operation doesn't introduce sequence-level recurrence and can be efficiently implemented, as shown in Tab. \ref{['tab:main_results']} and Tab. \ref{['tab:megatron-main']}.
  • Figure 2: Test BPC on Enwiki8 with different model sizes (6, 12, 18, 24, 32). Similar validation results are in App. \ref{['Add-results']} Fig. \ref{['fig:results-layers-val']}
  • Figure 3: Heat maps of cross-layer mutual information (MI) for different methods. The (i-th row, j-th column) value represents MI between layers i and j. The First Row ((a) SMoE, (b) XMoE, (c) HyperMoE): All three methods have low cross-layer MI. Second Row((d) RMoE, (e) RMoE-NP, (f) RMoE-NP-r1.0): While RMoE has high cross-layer MI when disabled layerwise recurrent states passing, MI largely drops.
  • Figure 4: Gate score entropy distribution over Enwiki8 test set for different router configurations. More similar results can be found in App. \ref{['app:more-router-entropy']} Fig. \ref{['fig:more-entropy-dist-8-layer']} and Fig. \ref{['fig:more-entropy-dist-8-layer-abl']}.
  • Figure 5: Experts similarity distribution across layers during large-scale pre-training. We plot box plots of expert similarity from checkpoints taken every 1k training steps (approximately 4B tokens), showing the expert similarity across the 24 layers of the model (with maximum, minimum, first quartile, median, and mean).
  • ...and 9 more figures