Table of Contents
Fetching ...

A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models

X. Y. Han, Yuan Zhong

TL;DR

This work develops a rigorous theoretical framework for Auxiliary-Loss-Free Load Balancing (ALF-LB) in sparse Mixture-of-Experts, formulating ALF-LB as a one-shot primal-dual update for an assignment problem. It proves deterministic properties including monotonic Lagrangian decrease, a switching preference that moves load from overloaded to underloaded experts, and an approximate balance guarantee, then extends to a stochastic online setting showing strong convexity and a logarithmic regret bound. The framework is validated with experiments on 1B-parameter DeepSeekMoE models, illustrating trade-offs between balancing efficiency and predictive performance across balancing schemes. Overall, the paper provides principled insights into why ALF-LB improves load balance without hindering training, offering a scalable theoretical lens for s-MoE load balancing in large AI models.

Abstract

In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of (costly) GPUs. We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure -- proposed by DeepSeek's Wang et al. (2024) -- by casting it as a one-step-per-iteration primal-dual method for an assignment problem. First, in a stylized deterministic setting, our framework yields several insightful structural properties: (i) a monotonic improvement of a Lagrangian objective, (ii) a preference rule that moves tokens from overloaded to underloaded experts, and (iii) an approximate-balancing guarantee. Then, we incorporate the stochastic and dynamic nature of AI training using a generalized online optimization formulation. In the online setting, we derive a strong convexity property of the objective that leads to a logarithmic expected regret bound under certain step-size choices. Additionally, we present real experiments on 1B-parameter DeepSeekMoE models to complement our theoretical findings. Together, these results build a principled framework for analyzing the Auxiliary-Loss-Free Load Balancing of s-MoE in AI models.

A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models

TL;DR

This work develops a rigorous theoretical framework for Auxiliary-Loss-Free Load Balancing (ALF-LB) in sparse Mixture-of-Experts, formulating ALF-LB as a one-shot primal-dual update for an assignment problem. It proves deterministic properties including monotonic Lagrangian decrease, a switching preference that moves load from overloaded to underloaded experts, and an approximate balance guarantee, then extends to a stochastic online setting showing strong convexity and a logarithmic regret bound. The framework is validated with experiments on 1B-parameter DeepSeekMoE models, illustrating trade-offs between balancing efficiency and predictive performance across balancing schemes. Overall, the paper provides principled insights into why ALF-LB improves load balance without hindering training, offering a scalable theoretical lens for s-MoE load balancing in large AI models.

Abstract

In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of (costly) GPUs. We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure -- proposed by DeepSeek's Wang et al. (2024) -- by casting it as a one-step-per-iteration primal-dual method for an assignment problem. First, in a stylized deterministic setting, our framework yields several insightful structural properties: (i) a monotonic improvement of a Lagrangian objective, (ii) a preference rule that moves tokens from overloaded to underloaded experts, and (iii) an approximate-balancing guarantee. Then, we incorporate the stochastic and dynamic nature of AI training using a generalized online optimization formulation. In the online setting, we derive a strong convexity property of the objective that leads to a logarithmic expected regret bound under certain step-size choices. Additionally, we present real experiments on 1B-parameter DeepSeekMoE models to complement our theoretical findings. Together, these results build a principled framework for analyzing the Auxiliary-Loss-Free Load Balancing of s-MoE in AI models.

Paper Structure

This paper contains 28 sections, 12 theorems, 38 equations, 4 figures, 1 table.

Key Result

Theorem 1

(Change in Lagrangian) Using the procedure described in Steps eq:dualupdate-eq:primalupdate (with $K=1$), the following holds for the Lagrangian eq:L:

Figures (4)

  • Figure 1: Schematic of a naïve s-MoE layer without load balancing.
  • Figure 2: Validation set load imbalance and loss during the training of a 1B-parameter DeepSeekMoE model. Section \ref{['sec:experiments']} gives experiment details. Left: We measure the imbalance as the average load deviation from the target load $L=KT/E$ across all experts in the DeepSeekMoE-1B architecture. Right: We measure the loss on the validation set.
  • Figure 3: Time-lapse histograms of the marginal distributions of $\gamma^{(n)}_{ik}$ during the training of 1B-parameter DeepSeekMoE models using different choices of step-size (Section \ref{['sec:primal_dual_alflb']}). Experimental details in Section \ref{['sec:experiments']}.
  • Figure 4: Time-lapse histograms of the marginal distributions of the ALF-LB biases $p$ during the training of 1B-parameter DeepSeekMoE models using different choices of step-size (Section \ref{['sec:primal_dual_alflb']}). No explicit constraints were enforced on $p$. Section \ref{['sec:experiments']} provides experimental details.

Theorems & Definitions (13)

  • Theorem 1
  • Theorem 2
  • Corollary 3
  • Proposition 4
  • Proposition 5
  • Corollary 6
  • Theorem 7
  • Proposition 8: Unbiased Stochastic Gradient
  • Proposition 9
  • Proposition 10: Second Directional Derivative
  • ...and 3 more