Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

Junzhuo Li; Peijie Jiang; Changxin Tian; Jia Liu; Zhiqiang Zhang; Xuming Hu

Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

Junzhuo Li, Peijie Jiang, Changxin Tian, Jia Liu, Zhiqiang Zhang, Xuming Hu

TL;DR

The Chinchilla scaling law is generalized by incorporating this architectural parameter, providing a new framework for tuning MoE models beyond size and data, and offering practical guidelines for designing efficient MoE models, optimizing performance while respecting fixed compute budgets.

Abstract

This paper presents a novel extension of neural scaling laws to Mixture-of-Experts (MoE) models, focusing on the optimal allocation of compute between expert and attention sub-layers. As MoE architectures have emerged as an efficient method for scaling model capacity without proportionally increasing computation, determining the optimal expert-attention compute ratio becomes critical. We define the ratio $r$ as the fraction of total FLOPs per token dedicated to the expert layers versus the attention layers, and explore how this ratio interacts with the overall compute budget and model sparsity. Through extensive experiments with GPT-style MoE Transformers, we empirically find that the optimal ratio $r^*$ follows a power-law relationship with total compute and varies with sparsity. Our analysis leads to an explicit formula for $r^*$, enabling precise control over the expert-attention compute allocation. We generalize the Chinchilla scaling law by incorporating this architectural parameter, providing a new framework for tuning MoE models beyond size and data. Our findings offer practical guidelines for designing efficient MoE models, optimizing performance while respecting fixed compute budgets.

Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

TL;DR

Abstract

as the fraction of total FLOPs per token dedicated to the expert layers versus the attention layers, and explore how this ratio interacts with the overall compute budget and model sparsity. Through extensive experiments with GPT-style MoE Transformers, we empirically find that the optimal ratio

follows a power-law relationship with total compute and varies with sparsity. Our analysis leads to an explicit formula for

, enabling precise control over the expert-attention compute allocation. We generalize the Chinchilla scaling law by incorporating this architectural parameter, providing a new framework for tuning MoE models beyond size and data. Our findings offer practical guidelines for designing efficient MoE models, optimizing performance while respecting fixed compute budgets.

Paper Structure (46 sections, 26 equations, 6 figures, 3 tables)

This paper contains 46 sections, 26 equations, 6 figures, 3 tables.

Introduction
Theoretical Motivation
Compute Allocation in MoE Transformers
Diminishing Returns under Sparse Expert Activation
Implications for Optimal Allocation
Empirical Scaling Behavior of the Optimal Compute Allocation
Existence of a Scale-Dependent Optimal Ratio
Scaling of the Optimal Ratio with Compute
Sparsity-Dependent Scaling Coefficients
Summary: An Empirical Law for Optimal Allocation
Scaling Laws with Expert--Attention Trade-offs
From Allocation Law to Loss Scaling
Empirical Validation of the Extended Scaling Law
Practical Implications under Fixed Compute Budgets
Related Work
...and 31 more sections

Figures (6)

Figure 1: Loss as a function of FLOPs ratio and total compute. Black dashed lines trace optimal $r^*$. Color indicates active parameters. Low-sparsity models (left) favor higher $r^*$ at scale.
Figure 2: (a) Relationship between the optimal FLOPs ratio $r^*$ and total per‐token compute $C$, fitted with the power law $r^* = \alpha_r C^{\beta_r}$. (b) Dependence of the fitted coefficient $\alpha_r$ on the fraction of activated experts $(1-S)$, showing a clear power‐law trend. (c) Dependence of the fitted exponent $\beta_r$ on $(1-S)$, also following a power‐law. All axes are plotted on log–log scales.
Figure 3: Scaling law fit on data obtained from training compute-optimal models. (a) shows the fit on the data used to estimate the coefficients for Eq. \ref{['eq:final']}, (b) validates these coefficients on a held-out dataset. All data points with S = 97.67% were excluded from the fitting process for out-of-sample validation.
Figure 4: Comparison between the actual training curve of a model with 30M activation parameters and 550M total parameters (with 95.38% sparsity) and the fitted curve derived from Equation \ref{['eq:final']}.
Figure 5: Comparison of observed versus predicted validation losses for two alternative scaling formulas. (a) Predictions using the wang-etal-2024-scaling formulation (Equation \ref{['eq:wang']}), which fails to generalize across modern high-sparsity settings. (b) Predictions using the wang-etal-2024-scaling formulation, which fits large models well but underperforms on smaller ones. The solid diagonal line indicates perfect agreement.
...and 1 more figures

Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

TL;DR

Abstract

Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

Authors

TL;DR

Abstract

Table of Contents

Figures (6)