Table of Contents
Fetching ...

AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models

Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, Zhijie Deng

TL;DR

AdaMoE introduces token-adaptive routing for mixture-of-experts language models by augmenting the expert set with null experts, allowing tokens to engage a variable number of true and null experts. The method increases the router's top-k capacity and applies a null-aware load-balancing loss to control average null usage, enabling flexible compute without sacrificing autoregressive modeling. Empirical results show notable FLOPs reductions (around 14–15%) and improved or competitive accuracy on both regular LLMs (Llama2-7B) with LoRA and MoE-LLMs (Mixtral-8x7B), across a range of tasks, with ablations validating design choices. AdaMoE is easy to integrate with pre-trained models and offers a practical path to more efficient, scalable MoE-based systems. Future work includes pre-training MoE-LLMs with AdaMoE, exploring identity mappings for null experts, and deeper analysis of load-balancing dynamics.

Abstract

Mixture of experts (MoE) has become the standard for constructing production-level large language models (LLMs) due to its promise to boost model capacity without causing significant overheads. Nevertheless, existing MoE methods usually enforce a constant top-k routing for all tokens, which is arguably restrictive because various tokens (e.g., "<EOS>" vs. "apple") may require various numbers of experts for feature abstraction. Lifting such a constraint can help make the most of limited resources and unleash the potential of the model for downstream tasks. In this sense, we introduce AdaMoE to realize token-adaptive routing for MoE, where different tokens are permitted to select a various number of experts. AdaMoE makes minimal modifications to the vanilla MoE with top-k routing -- it simply introduces a fixed number of null experts, which do not consume any FLOPs, to the expert set and increases the value of k. AdaMoE does not force each token to occupy a fixed number of null experts but ensures the average usage of the null experts with a load-balancing loss, leading to an adaptive number of null/true experts used by each token. AdaMoE exhibits a strong resemblance to MoEs with expert choice routing while allowing for trivial auto-regressive modeling. AdaMoE is easy to implement and can be effectively applied to pre-trained (MoE-)LLMs. Extensive studies show that AdaMoE can reduce average expert load (FLOPs) while achieving superior performance. For example, on the ARC-C dataset, applying our method to fine-tuning Mixtral-8x7B can reduce FLOPs by 14.5% while increasing accuracy by 1.69%.

AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models

TL;DR

AdaMoE introduces token-adaptive routing for mixture-of-experts language models by augmenting the expert set with null experts, allowing tokens to engage a variable number of true and null experts. The method increases the router's top-k capacity and applies a null-aware load-balancing loss to control average null usage, enabling flexible compute without sacrificing autoregressive modeling. Empirical results show notable FLOPs reductions (around 14–15%) and improved or competitive accuracy on both regular LLMs (Llama2-7B) with LoRA and MoE-LLMs (Mixtral-8x7B), across a range of tasks, with ablations validating design choices. AdaMoE is easy to integrate with pre-trained models and offers a practical path to more efficient, scalable MoE-based systems. Future work includes pre-training MoE-LLMs with AdaMoE, exploring identity mappings for null experts, and deeper analysis of load-balancing dynamics.

Abstract

Mixture of experts (MoE) has become the standard for constructing production-level large language models (LLMs) due to its promise to boost model capacity without causing significant overheads. Nevertheless, existing MoE methods usually enforce a constant top-k routing for all tokens, which is arguably restrictive because various tokens (e.g., "<EOS>" vs. "apple") may require various numbers of experts for feature abstraction. Lifting such a constraint can help make the most of limited resources and unleash the potential of the model for downstream tasks. In this sense, we introduce AdaMoE to realize token-adaptive routing for MoE, where different tokens are permitted to select a various number of experts. AdaMoE makes minimal modifications to the vanilla MoE with top-k routing -- it simply introduces a fixed number of null experts, which do not consume any FLOPs, to the expert set and increases the value of k. AdaMoE does not force each token to occupy a fixed number of null experts but ensures the average usage of the null experts with a load-balancing loss, leading to an adaptive number of null/true experts used by each token. AdaMoE exhibits a strong resemblance to MoEs with expert choice routing while allowing for trivial auto-regressive modeling. AdaMoE is easy to implement and can be effectively applied to pre-trained (MoE-)LLMs. Extensive studies show that AdaMoE can reduce average expert load (FLOPs) while achieving superior performance. For example, on the ARC-C dataset, applying our method to fine-tuning Mixtral-8x7B can reduce FLOPs by 14.5% while increasing accuracy by 1.69%.
Paper Structure (22 sections, 7 equations, 5 figures, 8 tables)

This paper contains 22 sections, 7 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The number of selected experts for various tokens in an $\mathbf{\mathcal{A}}$daMoE variant of Mixtral-8x7b. As shown, after applying $\mathbf{\mathcal{A}}$daMoE, the model possesses the ability to perform token-adaptive routing. Also note that some tokens only require 1 expert for feature abstraction, which offers the opportunity for inference acceleration.
  • Figure 2: Comparison of Routing Mechanisms: vanilla MoE v.s. $\mathbf{\mathcal{A}}$daMoE. Left: In vanilla MoE, each token selects the top 2 experts based on the routing probabilities. Right:$\mathbf{\mathcal{A}}$daMoE introduces an additional set of null experts and makes each token select the top 4 experts, which can include both the true and null experts. For example, token 1 selects three true experts, while token 2 selects only one true expert. Despite this variation, the average number of true experts selected per token remains two, maintaining parity with the vanilla method.
  • Figure 3: Proportions of the number of top experts with cumulative routing probabilities exceeding 50% for tokens in the SocialIQA dataset. Each bar represents the proportion of different counts of tokens at the corresponding MoE layer in Mixtral-8x7B.
  • Figure 4: Left: Adding null experts to Mo-LoRA. Right: Adding null experts to the MoE layer of MoE-LLMs.
  • Figure 5: Performance comparison across five datasets: RTE, COLA, SQA, CQA, and OQA. The baseline is fine-tuned Llama2-7B using the vanilla Mo-LoRA method with top-1/top-2 routing. Acc. represents accuracy, and Load represents the average number of experts used per Mo-LoRA module or $\mathbf{\mathcal{A}}$daMoE layer. $\mathbf{\mathcal{A}}$daMoE use different configurations: m5k2 (5 null experts, top-2 selection), m9k4, m7k4 and m5k4. As shown, $\mathbf{\mathcal{A}}$daMoE achieves higher accuracy across almost all datasets compared to the baseline. The exact accuracy values can be found in \ref{['tab:add_exp_1']}.