Table of Contents
Fetching ...

Load Balancing Mixture of Experts with Similarity Preserving Routers

Nabil Omi, Siddhartha Sen, Ali Farhadi

TL;DR

This paper tackles the issue of underutilized MoE routers by introducing SimBal, a similarity-preserving router objective that softly enforces token-wise relationships through a Gram-matrix loss $\mathcal{L}_{\text{orth}} = \left\| R^{\top} R - I_E \right\|_1$. By promoting near-orthogonality in router weights rather than hard constraints, SimBal preserves pairwise token similarities and prevents expert collapse, leading to faster convergence and less redundancy compared with traditional load balancing losses. Empirical results across MoE-M and MoE-L show ~36% faster training and improved perplexity, alongside a new Pairwise Expert Similarity (PES) metric that captures reduced redundancy and better specialization. Inference-time pruning further reveals that SimBal enables greater throughput gains with minimal perplexity loss, underscoring its practical benefits for scalable MoE training and deployment.

Abstract

Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks by activating only a subset of parameters ("experts") for each input. A learned router computes a distribution over these experts, and assigns input tokens to a small subset. However, without auxiliary balancing mechanisms, routers often converge to using only a few experts, severely limiting model capacity and degrading performance. Most current load balancing mechanisms encourage a distribution over experts that resembles a roughly uniform distribution of experts per token. During training, this can result in inconsistent routing behavior, resulting in the model spending its capacity to learn redundant knowledge. We address this by introducing a novel load balancing loss that preserves token-wise relational structure, encouraging consistent expert choices for similar inputs during training. Our experimental results show that applying our loss to the router results in 36% faster convergence and lower redundancy compared to a popular load balancing loss.

Load Balancing Mixture of Experts with Similarity Preserving Routers

TL;DR

This paper tackles the issue of underutilized MoE routers by introducing SimBal, a similarity-preserving router objective that softly enforces token-wise relationships through a Gram-matrix loss . By promoting near-orthogonality in router weights rather than hard constraints, SimBal preserves pairwise token similarities and prevents expert collapse, leading to faster convergence and less redundancy compared with traditional load balancing losses. Empirical results across MoE-M and MoE-L show ~36% faster training and improved perplexity, alongside a new Pairwise Expert Similarity (PES) metric that captures reduced redundancy and better specialization. Inference-time pruning further reveals that SimBal enables greater throughput gains with minimal perplexity loss, underscoring its practical benefits for scalable MoE training and deployment.

Abstract

Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks by activating only a subset of parameters ("experts") for each input. A learned router computes a distribution over these experts, and assigns input tokens to a small subset. However, without auxiliary balancing mechanisms, routers often converge to using only a few experts, severely limiting model capacity and degrading performance. Most current load balancing mechanisms encourage a distribution over experts that resembles a roughly uniform distribution of experts per token. During training, this can result in inconsistent routing behavior, resulting in the model spending its capacity to learn redundant knowledge. We address this by introducing a novel load balancing loss that preserves token-wise relational structure, encouraging consistent expert choices for similar inputs during training. Our experimental results show that applying our loss to the router results in 36% faster convergence and lower redundancy compared to a popular load balancing loss.

Paper Structure

This paper contains 20 sections, 10 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Validation loss curves for checkpoints during training. In both MoE-M and MoE-L, we achieve the same loss roughly 36% faster.
  • Figure 2: Expert utilization throughout training for MoE-M (left) and MoE-L (right), comparing LBL, our method (SimBal), and a baseline with no load balancing. We measure the number of unique experts activated on our full 77M-token validation set over time. Without any balancing, the expert routing collapses to a smaller set of experts. Both LBL and SimBal maintain full expert avoid expert collapse. The no-loss baseline was truncated early.
  • Figure 3: Analysis of expert redundancy in MoE-L models. (a) PES across different layers, our approach (blue) maintains significantly lower redundancy than LBL (orange). Darker = later in training. (b) Rate of change of PES during training, averaged over all layers. Redundancy occurs when many distinct experts see similar tokens, and is most likely to happen early in training, as we observe. We note that this is $>0$ at most points for LBL, suggesting it exacerbates redundancy during the majority of training.
  • Figure 4: Number of dropped top experts vs. validation loss, as proposed by dai2024deepseekmoeultimateexpertspecialization. SimBal exhibits lower redundancy, as shown by worse performance as more experts are dropped.
  • Figure 5: Rate of change in minimum PES (over the layers of a model) over a training run, comparing LBL (higher perplexity) and SimBal (lower perplexity).
  • ...and 1 more figures