Table of Contents
Fetching ...

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou

TL;DR

This work introduces Efficiency Leverage (EL) to quantify the compute efficiency of mixture-of-experts (MoE) models relative to dense Transformers. Through a large-scale study of over 300 MoE configurations up to 28B parameters, the authors show that EL is primarily driven by the expert activation ratio and the total compute budget, with expert granularity acting as a non-linear modulator that has an optimal range around 8–12. They derive separable and joint scaling laws predicting EL as a function of activation ratio, granularity, and compute, and validate these laws by training Ling-mini-beta (0.85B active parameters, 17.5B total) to outperform a 6.1B dense model on the same 1T-token dataset, achieving over 7x efficiency. The results provide a principled design space for efficient MoE architectures and demonstrate that carefully tuned MoE models can reach comparable performance with substantially reduced active parameters and training cost.

Abstract

Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration. To validate our derived scaling laws, we designed and trained Ling-mini-beta, a pilot model for Ling-2.0 series with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, Ling-mini-beta matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws. This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

TL;DR

This work introduces Efficiency Leverage (EL) to quantify the compute efficiency of mixture-of-experts (MoE) models relative to dense Transformers. Through a large-scale study of over 300 MoE configurations up to 28B parameters, the authors show that EL is primarily driven by the expert activation ratio and the total compute budget, with expert granularity acting as a non-linear modulator that has an optimal range around 8–12. They derive separable and joint scaling laws predicting EL as a function of activation ratio, granularity, and compute, and validate these laws by training Ling-mini-beta (0.85B active parameters, 17.5B total) to outperform a 6.1B dense model on the same 1T-token dataset, achieving over 7x efficiency. The results provide a principled design space for efficient MoE architectures and demonstrate that carefully tuned MoE models can reach comparable performance with substantially reduced active parameters and training cost.

Abstract

Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration. To validate our derived scaling laws, we designed and trained Ling-mini-beta, a pilot model for Ling-2.0 series with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, Ling-mini-beta matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws. This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.

Paper Structure

This paper contains 57 sections, 16 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Illustration of definition of Efficiency Leverage (EL) and its estimated values using Eq. \ref{['eq:all']} for $1e22$ FLOPs.
  • Figure 2: Scaling laws for optimal hyperparameters. Blue and red lines represent the fitted laws for MoE and dense models, respectively, derived on the same training dataset. Gray circles are the experimental data points used for fitting.
  • Figure 3: Validation of MoE hyperparameters scaling laws across different activation ratios ($A$). "Near-optimal" refers to hyperparameters achieving a loss within 0.25% of the optimal ones.
  • Figure 4: Scaling laws for optimal model scale ($M^{\text{opt}}$) and data size ($D^{\text{opt}}$) on identical datasets. For a given budget, MoE models (blue) optimally allocate more resources to data and fewer to model size compared to dense models (red).
  • Figure 5: Impact of the Activation Ratio $A$ on Loss and Efficiency. (a) At any fixed compute budget (each colored line), lower activation ratios yield lower loss. The orange stars mark the optimal (lowest) loss point. (b) Loss and EL scaling curves illustrate that EL increases with both higher compute budgets and lower activation ratios, showing that MoE advantages are magnified at scale.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Definition 3.1: Efficiency Leverage