Binary-Integer-Programming Based Algorithm for Expert Load Balancing in Mixture-of-Experts Models
Yuan Sun
TL;DR
The paper tackles unbalanced expert loads during MoE pre-training by introducing BIP-Based Balancing, which adds a per-layer vector $\bm q$ and solves a binary integer programming formulation to adjust the Top-K routing with minimal overhead. The optimization maximizes $\sum_{i=1}^n \sum_{j=1}^m s_{ij} x_{ij}$ subject to $\sum_{j=1}^m x_{ij} \le k$ and $\sum_{i=1}^n x_{ij} \le \frac{kn}{m}$, with $x_{ij} \in \{0,1\}$, and leverages LP duality and ADMM to update $p_i$ and $q_j$ so that the effective routing respects balance. Empirical results on Minimind MoE models with 16 and 64 experts show that BIP reduces balance violations and achieves lower perplexities while considerably shortening pre-training time compared with Loss-Controlled and Loss-Free methods. The work also discusses online and constant-space approximations, broadening applicability to online matching and recommender systems, and provides code for reproducibility.
Abstract
For pre-training of MoE (Mixture-of-Experts) models, one of the main issues is unbalanced expert loads, which may cause routing collapse or increased computational overhead. Existing methods contain the Loss-Controlled method and the Loss-Free method, where both the unbalanced degrees at first several training steps are still high and decrease slowly. In this work, we propose BIP-Based Balancing, an expert load balancing algorithm based on binary integer programming (BIP). The algorithm maintains an additional vector q on each MoE layer that can help change the top-K order of s by solving a binary integer programming with very small time costs. We implement the algorithm on two MoE language models: 16-expert (0.3B) and 64-expert (1.1B). The experimental results show that on both models comparing with the Loss-Controlled method and the Loss-Free method, our algorithm trains models with the lowest perplexities, while saves at least 13% of pre-training time compared with the Loss-Controlled method. Within our current knowledge, this is the first routing algorithm that achieves maintaining load balance status on every expert in every MoE layer from the first step to the last step during the whole pre-training process, while the trained MoE models also perform well. The code material of this work is available at https://github.com/sunyuanLLM/bip_routing_algorithm.
