Table of Contents
Fetching ...

Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

Jan Ludziejewski, Maciej Pióro, Jakub Krajewski, Maciej Stefaniak, Michał Krutul, Jan Małaśnicki, Marek Cygan, Piotr Sankowski, Kamil Adamczewski, Piotr Miłoś, Sebastian Jaszczur

TL;DR

This work addresses whether Mixture of Experts (MoE) can be memory-efficient under fixed hardware budgets by deriving a joint scaling law that ties the final loss to the number of active parameters, dataset size, and the number of experts. The authors propose the law $L(N_act, D, E_hat) = a E_hat^delta N_act^{alpha + gamma ln(E_hat)} + b E_hat^omega D^{beta + zeta ln(E_hat)} + c$ and introduce the transformed expert count $E_hat$ to stabilize fitting, validating it across more than 280 experiments up to 2.7B active parameters and 5B total parameters. They show that MoE can outperform dense models under the same compute or memory budgets and provide practical rules for selecting the number of experts and token budgets under memory constraints, including compute-, memory-, and inference-oriented optima. The findings imply that MoE can achieve lower loss and higher inference performance while reducing FLOPs per token, offering a principled approach to memory-aware MoE deployment in large-scale training.

Abstract

Mixture of Experts (MoE) architectures have significantly increased computational efficiency in both research and real-world applications of large-scale machine learning models. However, their scalability and efficiency under memory constraints remain relatively underexplored. In this work, we present joint scaling laws for dense and MoE models, incorporating key factors such as the number of active parameters, dataset size, and the number of experts. Our findings provide a principled framework for selecting the optimal MoE configuration under fixed memory and compute budgets. Surprisingly, we show that MoE models can be more memory-efficient than dense models, contradicting conventional wisdom. To derive and validate the theoretical predictions of our scaling laws, we conduct over 280 experiments with up to 2.7B active parameters and up to 5B total parameters. These results offer actionable insights for designing and deploying MoE models in practical large-scale training scenarios.

Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

TL;DR

This work addresses whether Mixture of Experts (MoE) can be memory-efficient under fixed hardware budgets by deriving a joint scaling law that ties the final loss to the number of active parameters, dataset size, and the number of experts. The authors propose the law and introduce the transformed expert count to stabilize fitting, validating it across more than 280 experiments up to 2.7B active parameters and 5B total parameters. They show that MoE can outperform dense models under the same compute or memory budgets and provide practical rules for selecting the number of experts and token budgets under memory constraints, including compute-, memory-, and inference-oriented optima. The findings imply that MoE can achieve lower loss and higher inference performance while reducing FLOPs per token, offering a principled approach to memory-aware MoE deployment in large-scale training.

Abstract

Mixture of Experts (MoE) architectures have significantly increased computational efficiency in both research and real-world applications of large-scale machine learning models. However, their scalability and efficiency under memory constraints remain relatively underexplored. In this work, we present joint scaling laws for dense and MoE models, incorporating key factors such as the number of active parameters, dataset size, and the number of experts. Our findings provide a principled framework for selecting the optimal MoE configuration under fixed memory and compute budgets. Surprisingly, we show that MoE models can be more memory-efficient than dense models, contradicting conventional wisdom. To derive and validate the theoretical predictions of our scaling laws, we conduct over 280 experiments with up to 2.7B active parameters and up to 5B total parameters. These results offer actionable insights for designing and deploying MoE models in practical large-scale training scenarios.

Paper Structure

This paper contains 25 sections, 16 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: (a) The loss of memory-constrained models predicted using our scaling law under a fixed training budget of $10^{22}$ FLOPs. Each curve represents a different number of experts. The lines are truncated at compute-optimal points since undertrained models are both larger and worse in terms of loss, thus pointless in a memory-constrained scenario. Shaded areas indicate the memory-optimal number of experts for the corresponding memory budgets. (b) Experimental validation of the thesis that MoE can be memory-optimal. The marked area shows an interval in which training a compute-matched MoE achieves better loss than an overtrained dense model with the same number of total parameters ($1.1$B). Such an MoE is trained for longer and has fewer active parameters, making it more practical for inference.
  • Figure 2: (a) IsoFLOP profiles for selected training budgets, with compute-optimal points marked for each curve. (b) FLOP savings from switching from a compute-optimal dense model to a compute-optimal MoE. For instance, 40% savings at $1$e$20$ FLOPs mean that an MoE matching the performance of a compute-optimal dense model trained with $1$e$20$ FLOPs can be trained with just $6$e$19$ FLOPs (60% of the dense's budget). The advantage of using MoE increases with larger models and expert counts.
  • Figure 3: Predicted loss for various numbers of experts at a FLOPs budget $F= 5 \times 10^{22}$. The x-axis represents the size of the model in terms of the number of parameters (a) or the total memory budget for both model parameters and KV cache for $8192$ tokens (b, c). Shaded areas indicate the optimal number of experts for the corresponding parameter or memory budget. (c) In addition to the KV cache, the inference cost on $100$B tokens is included in the FLOPs budget of $F= 5 \times 10^{22}$.
  • Figure 4: Investigation of the optimal number of experts for three different model sizes: $2$B, $5$B, and $10$B; and in three different scenarios from left to right: simply measuring the model size, including the size of a KV-cache with 32k tokens, and including the inference cost of processing 100B tokens. Note that in the second graph, the memory constraint corresponds to the memory requirements of dense models with sizes $2$B, $5$B, and $10$B, including the KV cache, while utilizing bfloat16 for both parameters and activations.
  • Figure 5: (a) Quality of the fit. The maximum absolute error on the held-out extrapolation set is $0.018$. (b) Predicted loss compared to observed loss for $E=1$. (c) Predicted loss (dashed line) compared to observed loss for $E=4$. We can see that on the training dataset, the error increases in an undertrained setting ($D/N<1$ --- more tokens than parameters). However, this scenario is never practical from our perspective.
  • ...and 3 more figures