Table of Contents
Fetching ...

Toward Inference-optimal Mixture-of-Expert Large Language Models

Longfei Yun, Yonghao Zhuang, Yao Fu, Eric P Xing, Hao Zhang

TL;DR

The paper extends the Transformer scaling law to mixture-of-experts LLMs by incorporating the number of experts $E$ and the training dataset size $D$, revealing diminishing returns and a saturation point $E_{max}$. It then enforces budget-aware optimization by introducing inference cost as a constraint, showing that 4–8 expert MoEs offer efficient serving but require more training, while over-trained smaller MoEs with more data can achieve comparable or better performance at lower inference cost. The authors formulate a practical framework for estimating MoE inference cost and demonstrate actionable trade-offs between training and serving, including strategies for bounding loss or inference cost. Collectively, the work provides guidance for deploying MoE LLMs under real-world budgets, highlighting the potential of over-training to achieve inference-efficiency without sacrificing quality.

Abstract

Mixture-of-Expert (MoE) based large language models (LLMs), such as the recent Mixtral and DeepSeek-MoE, have shown great promise in scaling model size without suffering from the quadratic growth of training cost of dense transformers. Like dense models, training MoEs requires answering the same question: given a training budget, what is the optimal allocation on the model size and number of tokens? We study the scaling law of MoE-based LLMs regarding the relations between the model performance, model size, dataset size, and the expert degree. Echoing previous research studying MoE in different contexts, we observe the diminishing return of increasing the number of experts, but this seems to suggest we should scale the number of experts until saturation, as the training cost would remain constant, which is problematic during inference time. We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss. We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training. On the other hand, training a (16/32) expert MoE much smaller (70-85%) than the loss-optimal solution, but with a larger training dataset is a promising setup under a training budget.

Toward Inference-optimal Mixture-of-Expert Large Language Models

TL;DR

The paper extends the Transformer scaling law to mixture-of-experts LLMs by incorporating the number of experts and the training dataset size , revealing diminishing returns and a saturation point . It then enforces budget-aware optimization by introducing inference cost as a constraint, showing that 4–8 expert MoEs offer efficient serving but require more training, while over-trained smaller MoEs with more data can achieve comparable or better performance at lower inference cost. The authors formulate a practical framework for estimating MoE inference cost and demonstrate actionable trade-offs between training and serving, including strategies for bounding loss or inference cost. Collectively, the work provides guidance for deploying MoE LLMs under real-world budgets, highlighting the potential of over-training to achieve inference-efficiency without sacrificing quality.

Abstract

Mixture-of-Expert (MoE) based large language models (LLMs), such as the recent Mixtral and DeepSeek-MoE, have shown great promise in scaling model size without suffering from the quadratic growth of training cost of dense transformers. Like dense models, training MoEs requires answering the same question: given a training budget, what is the optimal allocation on the model size and number of tokens? We study the scaling law of MoE-based LLMs regarding the relations between the model performance, model size, dataset size, and the expert degree. Echoing previous research studying MoE in different contexts, we observe the diminishing return of increasing the number of experts, but this seems to suggest we should scale the number of experts until saturation, as the training cost would remain constant, which is problematic during inference time. We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss. We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training. On the other hand, training a (16/32) expert MoE much smaller (70-85%) than the loss-optimal solution, but with a larger training dataset is a promising setup under a training budget.
Paper Structure (36 sections, 13 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 36 sections, 13 equations, 8 figures, 1 table, 2 algorithms.

Figures (8)

  • Figure 1: Validation losses for different $D$. Scattered dots show the actual losses, and dotted lines correspond to values fitted by \ref{['eq:moe_scaling_law']}.
  • Figure 2: MoE inference cost. Cost increases proportionally with model size.
  • Figure 3: Trade-off between inference cost, model performance, and training cost. Inference cost of and model performance for MoE models under different training budgets (left); Model performance with different training FLOPs (middle); Inference cost of different training FLOPs (right). Under the same budget, more experts means a better quality but higher inference cost. Fewer experts can reach a lower inference cost with the same quality, but needs much more training FLOPs
  • Figure 4: loss-cost curve for a given training budget. The over-trained 16-expert model achieves both better performance and lower inference cost than loss-optimal 4/8 expert model.
  • Figure 5: Optimal inference cost for a bounded loss. Minimum achievable inference cost with a bounded loss (left). Ratio of model size to the base model (right).
  • ...and 3 more figures