Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization

Weilin Wan; Jingtao Han; Weizhong Zhang; Cheng Jin

Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization

Weilin Wan, Jingtao Han, Weizhong Zhang, Cheng Jin

Abstract

Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. Existing MoE scaling studies are constrained by experimental budgets to either augment scaling formulas with extra MoE variables, risking unreliable fits, or fix all non-MoE factors, ignoring global interactions. We propose a reusable framework for holistic MoE architectural optimization that bridges this gap. We first show that FLOPs per token alone is an inadequate fairness metric for MoE models because differing computational densities across layer types can inflate parameters without proportional compute cost, and establish a joint constraint triad of FLOPs per token, active parameters, and total parameters. We then reduce the 16-dimensional architectural search space to two sequential low-dimensional phases through algebraic constraints and a rank-preserving property of the hidden dimension. Validated across hundreds of MoE models spanning six orders of magnitude in compute, our framework yields robust scaling laws that map any compute budget to a complete, optimal MoE architecture. A key finding is that the near-optimal configuration band widens with scale, giving practitioners quantitative flexibility to balance scaling law recommendations against infrastructure constraints.

Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization

Abstract

Paper Structure (45 sections, 8 equations, 12 figures, 21 tables)

This paper contains 45 sections, 8 equations, 12 figures, 21 tables.

Introduction
Related Work
Scaling laws for language models
Scaling laws for Mixture-of-Experts (MoE)
Parameter allocation in scaling laws
Preliminaries and general setup
Notations and definitions
Defining Model Scale via Computation
Scaling laws for optimal MoE resource allocation
Common Experimental Infrastructure
Decoupling and reducing MoE scaling dimensions
Why $(M, N_a, N)$ for MoE scaling
Reduction of scaling dimensions
Leveraging classical scaling laws for initial reduction
Algebraic reduction to four degrees of freedom
...and 30 more sections

Figures (12)

Figure 1: Overview of the dimension reduction pipeline. Starting from the full 16-dimensional MoE architectural space, we systematically reduce the search complexity to two sequential phases of $\mathcal{O}(n^3)$ and $\mathcal{O}(n^2)$ through classical scaling laws, algebraic constraints, engineering fixations, and the rank-preserving property of $d$.
Figure 2: Relationship between the median proxy (median feasible $d$ proxy) and the true minimum loss under a fixed budget. The figure displays a scatter plot of the median proxy against the true minimum loss, along with a least-squares linear fit.
Figure 3: Visualization of the mathematical feasible region for the model design space on the $(N_a, N)$ plane with $M=5.9390$ GFLOPs. (a) shows the minimum available hidden dimension $d$ for each $(N_a, N)$ point, (b) shows the maximum hidden dimension $d$, and (c) illustrates the range width of feasible $d$ values.
Figure 4: Results visualized across different parameter spaces. (a) depicts the loss landscape in $(N_a, N)$ for a specific compute budget ($C=10^{20}$, $M=3.2681$ GFLOPs); additional visualizations for different compute scales are available in Appendix \ref{['app:vis_3d']}. (b) transforms this into the $(M/N_a, N/N_a)$ space, highlighting a black profile curve within the optimal loss region. (c) presents normalized loss profiles along $N/N_a$ for various compute scales, demonstrating the consistent flatness of the loss landscape near its optimal value.
Figure 5: Analysis of $M/N_a$ optimization. The shaded grey regions in both subfigures represent "Near Optimal Bands," indicating parameter combinations where the loss is within 0.1% of the minimum for that scale, thereby implying robustness to slight deviations from the exact optimum. Specifically, (a) illustrates the loss data points and fitted curves across various compute scales, showing the relationship between loss and $M/N_a$. (b) depicts the derived scaling laws for the optimal $M/N_a$ as a function of the compute budget $C$, demonstrating how the preferred $M/N_a$ changes with increasing computational resources while remaining robust within these near-optimal bounds.
...and 7 more figures

Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization

Abstract

Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization

Authors

Abstract

Table of Contents

Figures (12)