Table of Contents
Fetching ...

Grouter: Decoupling Routing from Representation for Accelerated MoE Training

Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, Kun Yuan

TL;DR

Grouter is introduced, a preemptive routing method that by distilling high-quality structures from fully-trained MoE models and serving as a fixed router for target models significantly accelerates both the speed and quality of model convergence.

Abstract

Traditional Mixture-of-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultaneously train expert weights while searching for an optimal routing policy within a vast combinatorial space. This entanglement often leads to sluggish convergence and training instabilities. This paper introduces Grouter, a preemptive routing method that by distilling high-quality structures from fully-trained MoE models and serving as a fixed router for target models. By decoupling structural optimization from weight updates, Grouter significantly accelerates both the speed and quality of model convergence. To ensure the framework's versatility, we also introduce expert folding to adapt Grouter across varying model configurations and expert tuning to rebalance workloads across different data distributions. Furthermore, by leveraging the structural priors provided by preemptive routing, we can implement targeted optimizations to further enhance training throughput. Experiments demonstrate that Grouter achieves superior performance and efficiency which boosts pre-training data utilization by 4.28x and achieves up to 33.5% throughput acceleration, establishing preemptive routing as a fundamental paradigm for scalable MoE training.

Grouter: Decoupling Routing from Representation for Accelerated MoE Training

TL;DR

Grouter is introduced, a preemptive routing method that by distilling high-quality structures from fully-trained MoE models and serving as a fixed router for target models significantly accelerates both the speed and quality of model convergence.

Abstract

Traditional Mixture-of-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultaneously train expert weights while searching for an optimal routing policy within a vast combinatorial space. This entanglement often leads to sluggish convergence and training instabilities. This paper introduces Grouter, a preemptive routing method that by distilling high-quality structures from fully-trained MoE models and serving as a fixed router for target models. By decoupling structural optimization from weight updates, Grouter significantly accelerates both the speed and quality of model convergence. To ensure the framework's versatility, we also introduce expert folding to adapt Grouter across varying model configurations and expert tuning to rebalance workloads across different data distributions. Furthermore, by leveraging the structural priors provided by preemptive routing, we can implement targeted optimizations to further enhance training throughput. Experiments demonstrate that Grouter achieves superior performance and efficiency which boosts pre-training data utilization by 4.28x and achieves up to 33.5% throughput acceleration, establishing preemptive routing as a fundamental paradigm for scalable MoE training.
Paper Structure (34 sections, 19 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 19 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: (a) The percentage of tokens that maintain an exactly identical set of $\mathbf{k}$ activated experts for the same input across adjacent checkpoints. We present the more detailed descriptions and figures in \ref{['Detailed Desctiption of Routing Fluctuation Heatmap']}. (b) Sensitivity analysis of expert specialization via random routing perturbations. Perturbations were applied at specific intervals with a fixed learning rate of $10^{-5}$. The resulting loss trajectories (lines) and average gradient norms (bars) reveal a clear trend: while early-stage training (1k steps) is resilient to routing noise due to a lack of specialization, later stages exhibit severe instability and loss spikes under perturbation. This demonstrates that experts gradually develop deep specialization as the router stabilizes, making the model increasingly sensitive to routing errors. (c) The coefficient of variation of the gradient norm within a sliding window of length 100. The results demonstrate that the Grouter method achieves a significant improvement in stability, whereas both the state-of-the-art load balancing approach and pure static routing assignment lead to substantial gradient fluctuations.
  • Figure 2: (a) Overview of the GRouter Workflow. The GRouter first extracts a highly optimized structural prior from the Source Model, and then injects this prior into the Target Model in a frozen state. (b) Illustration of Our Expert Tuning and Expert Folding Techniques.
  • Figure 3: The workflow of our EP communication optimization strategy. Step 1: Sequences pass through GRouter to generate token-level assignments, which are aggregated into sequence-level routing affinity vectors. Step 2: Sequences are clustered based on affinity. Using cluster centroids as preference weights, we solve an optimization problem to assign experts to EP devices, maximizing the alignment between experts and sequence clusters. Step 3: With expert locations fixed, sequences are assigned to the EP device that minimizes the resulting communication volume.
  • Figure 4: (a) Pre-training Validation Loss Curves Across 30B Tokens. $\text{Grouter Raw}$ denotes the distilled $\text{Grouter}$ used without the subsequent Expert-Tuning. The shaded Gap region illustrates the loss difference between the $\text{Grouter}$ curve and the best baseline. Overall, $\text{Grouter}$ achieves a $4.28\times$ acceleration or a maximum loss reduction of $0.85$ at the same training data volume. (b) Comparison of Performance and Load Balance Trade-off. The third quadrant is highlighted in green to emphasize that models located within this region achieve an optimal balance between low load violation and superior model performance
  • Figure 5: (a) Downstream Task Results: GRouter achieves an average improvement of 2.80 across six benchmarks, with gains of up to 10 points on specific tasks. This demonstrates that the reduced validation loss achieved by GRouter translates into genuine enhancements in model capabilities, rather than merely overfitting the validation metric. (b) Throughput Scaling: We evaluate throughput on 1, 2, and 4 nodes with Expert Parallelism (EP) degrees set to 8, 16, and 32, respectively. For the multi-node setups (2 and 4 nodes), we apply node-granularity communication optimization, while for the single-node case, we utilize GPU-granularity optimization.
  • ...and 6 more figures