Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

Jan Ludziejewski; Maciej Pióro; Jakub Krajewski; Maciej Stefaniak; Michał Krutul; Jan Małaśnicki; Marek Cygan; Piotr Sankowski; Kamil Adamczewski; Piotr Miłoś; Sebastian Jaszczur

Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

Jan Ludziejewski, Maciej Pióro, Jakub Krajewski, Maciej Stefaniak, Michał Krutul, Jan Małaśnicki, Marek Cygan, Piotr Sankowski, Kamil Adamczewski, Piotr Miłoś, Sebastian Jaszczur

TL;DR

This work addresses whether Mixture of Experts (MoE) can be memory-efficient under fixed hardware budgets by deriving a joint scaling law that ties the final loss to the number of active parameters, dataset size, and the number of experts. The authors propose the law $L(N_act, D, E_hat) = a E_hat^delta N_act^{alpha + gamma ln(E_hat)} + b E_hat^omega D^{beta + zeta ln(E_hat)} + c$ and introduce the transformed expert count $E_hat$ to stabilize fitting, validating it across more than 280 experiments up to 2.7B active parameters and 5B total parameters. They show that MoE can outperform dense models under the same compute or memory budgets and provide practical rules for selecting the number of experts and token budgets under memory constraints, including compute-, memory-, and inference-oriented optima. The findings imply that MoE can achieve lower loss and higher inference performance while reducing FLOPs per token, offering a principled approach to memory-aware MoE deployment in large-scale training.

Abstract

Mixture of Experts (MoE) architectures have significantly increased computational efficiency in both research and real-world applications of large-scale machine learning models. However, their scalability and efficiency under memory constraints remain relatively underexplored. In this work, we present joint scaling laws for dense and MoE models, incorporating key factors such as the number of active parameters, dataset size, and the number of experts. Our findings provide a principled framework for selecting the optimal MoE configuration under fixed memory and compute budgets. Surprisingly, we show that MoE models can be more memory-efficient than dense models, contradicting conventional wisdom. To derive and validate the theoretical predictions of our scaling laws, we conduct over 280 experiments with up to 2.7B active parameters and up to 5B total parameters. These results offer actionable insights for designing and deploying MoE models in practical large-scale training scenarios.

Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

TL;DR

Abstract

Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)